[PATCH 00/16] DRBD: a block device for HA clusters

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/16] DRBD: a block device for HA clusters
@ 2009-04-30 11:26 Philipp Reisner
  2009-05-01  8:59 ` Andrew Morton
  2009-05-03  5:53 ` Neil Brown
  0 siblings, 2 replies; 44+ messages in thread
From: Philipp Reisner @ 2009-04-30 11:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jens Axboe, Greg KH, Neil Brown, James Bottomley, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg, Philipp Reisner

Hi,

This is a repost of DRBD, to keep you updated about the ongoing
cleanups and improvements.

Patch set attached. Git tree available:
git pull git://git.drbd.org/linux-2.6-drbd.git drbd

We are looking for reviews!

Description

  DRBD is a shared-nothing, synchronously replicated block device. It
  is designed to serve as a building block for high availability
  clusters and in this context, is a "drop-in" replacement for shared
  storage. Simplistically, you could see it as a network RAID 1.

  Although I use the "RAID1+NBD" metaphor myself, recent discussion
  unveiled that one needs to understand the differences as well.
  Here are just two examples of that:

   1) Think of a two node HA cluster. Node A is active ('primary' in DRBD
    speak) has the filesystem mounted and the application running. Node B is
    in standby mode ('secondary' in DRBD speak).

    We loose network connectivity, the primary node continues to run, the
    secondary no longer gets updates.

    Then we have a complete power failure, both nodes are down. Then they
    power up the data center again, but at first the get only the power
    circuit of node B up and running again.

    Should node B offer the service right now ?
      ( DRBD has configurable policies for that )

    Later on they manage to get node A up and running again, now lets assume
    node B was chosen to be the new primary node. What needs to be done ?

    Modifications on B since it became primary needs to be resynced to A.
    Modifications on A sind it lost contact to B needs to be taken out.

    DRBD does that.

    How do you fit that into a RAID1+NBD model ? NBD is just a block
    transport, it does not offer the ability to exchange dirty bitmaps or
    data generation identifiers, nor does the RAID1 code has a concept of
    that.

   2) When using DRBD over small bandwidth links, one has to run a resync,
    DRBD offers the option to do a "checksum based resync". Similar to rsync
    it at first only exchanges a checksum, and transmits the whole data
    block only if the checksums differ.

    That again is something that does not fit into the concepts of NBD or RAID1.

  DRBD can also be used in dual-Primary mode (device writable on both
  nodes), which means it can exhibit shared disk semantics in a
  shared-nothing cluster.  Needless to say, on top of dual-Primary
  DRBD utilizing a cluster file system is necessary to maintain for
  cache coherency.

  More background on this can be found in this paper:
    http://www.drbd.org/fileadmin/drbd/publications/drbd8.pdf

  Beyond that, DRBD addresses various issues of cluster partitioning,
  which the MD/NBD stack, to the best of our knowledge, does not
  solve. The above-mentioned paper goes into some detail about that as
  well.

  DRBD can operate in synchronous mode, or in asynchronous mode. I want
  to point out that we guarantee not to violate a single possible write
  after write dependency when writing on the standby node. More on that
  can be found in this paper:
    http://www.drbd.org/fileadmin/drbd/publications/drbd_lk9.pdf

  Last not least DRBD offers background resynchronisation and keeps
  a on disk representation of the dirty bitmap up-to-date. A reasonable
  tradeoff between number of updates, and resyncing more than needed
  is implemented with the activity log.
  More on that:
    http://www.drbd.org/fileadmin/drbd/publications/drbd-activity-logging_v6.pdf

Changes since 2009-04-10

  * Cleanup: Removed all CamelCase
  * Cleanup: Replaced DRBD's own tracing stuff with regular tracepoints
  * Cleanup: Removed ERR/INFO/ALERT ... macros, using dev_err/dev_info/... now
  * Cleanup: Minor stuff, as suggested in feedback on LKML
  * DRBD:    Bitmap compression feature was finalised
  * DRBD:    new disable_sendpage parameter

Changes since the post on 2009-03-30, all triggered by reviews

  * Improvements to Makefile and Kconfig
  * Simplified definitions of bm_flags' bitnumbers
  * Removed debugging aid

Changes since the post on 2009-03-23, from drbd-mainline

  * Updated to the final drbd-8.3.1 code
  * Optionally run-length encode bitmap transfers

Changes since the post on 2009-03-23, triggered by reviews

  * Using the latest proc_create() now
  * Moved the allocation of md_io_tmpp to attach/detach out of drbd_md_sync_page_io()
  * Removing the mode selection comments for emacs
  * Removed DRBD_ratelimit()

cheers,
  Phil

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-04-30 11:26 Philipp Reisner
@ 2009-05-01  8:59 ` Andrew Morton
  2009-05-01 11:15   ` Lars Marowsky-Bree
  2009-05-02  7:33   ` Bart Van Assche
  2009-05-03  5:53 ` Neil Brown
  1 sibling, 2 replies; 44+ messages in thread
From: Andrew Morton @ 2009-05-01  8:59 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: linux-kernel, Jens Axboe, Greg KH, Neil Brown, James Bottomley,
	Sam Ravnborg, Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg

On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:

> This is a repost of DRBD

How fast is it?

Is it being used anywhere for anything?  If so, where and what?

(it would be useful to add such info to the changelog, and to
maintain it)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-01  8:59 ` Andrew Morton
@ 2009-05-01 11:15   ` Lars Marowsky-Bree
  2009-05-01 13:14     ` Dave Jones
  2009-05-05  4:05     ` Christian Kujau
  2009-05-02  7:33   ` Bart Van Assche
  1 sibling, 2 replies; 44+ messages in thread
From: Lars Marowsky-Bree @ 2009-05-01 11:15 UTC (permalink / raw)
  To: Andrew Morton, Philipp Reisner
  Cc: linux-kernel, Jens Axboe, Greg KH, Neil Brown, James Bottomley,
	Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg

On 2009-05-01T01:59:02, Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
> 
> > This is a repost of DRBD
> How fast is it?

>From experience, it achieves performance of approx. 98% of wire or
spindle speed, so it is considered rather efficient code.

> Is it being used anywhere for anything?  If so, where and what?

It is used by many customers (thousands world-wide, I'm sure) to
replicate block device data locally (to replace more expensive SANs
while achieving higher availablity) or async/remotely (for disaster
recovery).

The code is rather stable, the first drbd deployments date back many
years - drbd0.7 for example has been shipping with SLES10/9, and 0.6
with SLES8 already. The new drbd8 code is shipping on SLE11 and used
also in combination with OCFS2.

So we very much welcome the renewed and persistent interest of merging
the code in mainline (once all serious issues are addressed).

Even if in the long-term a merge with other raid implementations is
pursued (which I'd welcome even more), the existence of so many
deployments means we'll need the code for awhile still.

Regards,
    Lars

-- 
SuSE Labs, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-01 11:15   ` Lars Marowsky-Bree
@ 2009-05-01 13:14     ` Dave Jones
  2009-05-01 19:14       ` Andrew Morton
  2009-05-05  4:05     ` Christian Kujau
  1 sibling, 1 reply; 44+ messages in thread
From: Dave Jones @ 2009-05-01 13:14 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: Andrew Morton, Philipp Reisner, linux-kernel, Jens Axboe, Greg KH,
	Neil Brown, James Bottomley, Sam Ravnborg, Nikanth Karthikesan,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg

On Fri, May 01, 2009 at 01:15:54PM +0200, Lars Marowsky-Bree wrote:
 
 > > Is it being used anywhere for anything?  If so, where and what?
 > 
 > It is used by many customers (thousands world-wide, I'm sure) to
 > replicate block device data locally (to replace more expensive SANs
 > while achieving higher availablity) or async/remotely (for disaster
 > recovery).
 > 
 > The code is rather stable, the first drbd deployments date back many
 > years - drbd0.7 for example has been shipping with SLES10/9, and 0.6
 > with SLES8 already. The new drbd8 code is shipping on SLE11 and used
 > also in combination with OCFS2.
 > 
 > So we very much welcome the renewed and persistent interest of merging
 > the code in mainline (once all serious issues are addressed).
 > 
 > Even if in the long-term a merge with other raid implementations is
 > pursued (which I'd welcome even more), the existence of so many
 > deployments means we'll need the code for awhile still.

I've not looked through the patchset, and it's a bit outside my
domain of expertise, but I can attest we have had requests to
merge it in Fedora (which we've given the usual "get it upstream" response to).
The folks who run the Fedora infrastructure have been enthusiastic
about it for a while (which is why I ended up on the CC for this thread I guess).
I don't have details about their exact use-cases, but if desired, I can
find out more.

	Dave


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-01 13:14     ` Dave Jones
@ 2009-05-01 19:14       ` Andrew Morton
  0 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2009-05-01 19:14 UTC (permalink / raw)
  To: Dave Jones
  Cc: lmb, philipp.reisner, linux-kernel, jens.axboe, gregkh, neilb,
	James.Bottomley, sam, knikanth, nab, kyle, bart.vanassche,
	lars.ellenberg

On Fri, 1 May 2009 09:14:25 -0400
Dave Jones <davej@redhat.com> wrote:

> On Fri, May 01, 2009 at 01:15:54PM +0200, Lars Marowsky-Bree wrote:
>  
>  > > Is it being used anywhere for anything?  If so, where and what?
>  > 
>  > It is used by many customers (thousands world-wide, I'm sure) to
>  > replicate block device data locally (to replace more expensive SANs
>  > while achieving higher availablity) or async/remotely (for disaster
>  > recovery).
>  > 
>  > The code is rather stable, the first drbd deployments date back many
>  > years - drbd0.7 for example has been shipping with SLES10/9, and 0.6
>  > with SLES8 already. The new drbd8 code is shipping on SLE11 and used
>  > also in combination with OCFS2.
>  > 
>  > So we very much welcome the renewed and persistent interest of merging
>  > the code in mainline (once all serious issues are addressed).
>  > 
>  > Even if in the long-term a merge with other raid implementations is
>  > pursued (which I'd welcome even more), the existence of so many
>  > deployments means we'll need the code for awhile still.
> 
> I've not looked through the patchset, and it's a bit outside my
> domain of expertise, but I can attest we have had requests to
> merge it in Fedora (which we've given the usual "get it upstream" response to).
> The folks who run the Fedora infrastructure have been enthusiastic
> about it for a while (which is why I ended up on the CC for this thread I guess).
> I don't have details about their exact use-cases, but if desired, I can
> find out more.
> 

Oh.  Thanks.  Well we should all get cracking on it then.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-01  8:59 ` Andrew Morton
  2009-05-01 11:15   ` Lars Marowsky-Bree
@ 2009-05-02  7:33   ` Bart Van Assche
  2009-05-03  5:36     ` Willy Tarreau
  1 sibling, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2009-05-02  7:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH, Neil Brown,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Kyle Moffett, Lars Ellenberg

On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
>
>> This is a repost of DRBD
>
> Is it being used anywhere for anything?  If so, where and what?

One popular application is to run iSCSI and HA software on top of DRBD
in order to build a highly available iSCSI storage target.

Bart.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-02  7:33   ` Bart Van Assche
@ 2009-05-03  5:36     ` Willy Tarreau
  2009-05-03  5:40       ` david
  2009-05-03 10:06       ` Philipp Reisner
  0 siblings, 2 replies; 44+ messages in thread
From: Willy Tarreau @ 2009-05-03  5:36 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Andrew Morton, Philipp Reisner, linux-kernel, Jens Axboe, Greg KH,
	Neil Brown, James Bottomley, Sam Ravnborg, Dave Jones,
	Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> > On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
> >
> >> This is a repost of DRBD
> >
> > Is it being used anywhere for anything?  If so, where and what?
> 
> One popular application is to run iSCSI and HA software on top of DRBD
> in order to build a highly available iSCSI storage target.

Confirmed, I have several customers who're doing exactly that.

Willy


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03  5:36     ` Willy Tarreau
@ 2009-05-03  5:40       ` david
  2009-05-03 14:21         ` James Bottomley
  2009-05-03 10:06       ` Philipp Reisner
  1 sibling, 1 reply; 44+ messages in thread
From: david @ 2009-05-03  5:40 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Bart Van Assche, Andrew Morton, Philipp Reisner, linux-kernel,
	Jens Axboe, Greg KH, Neil Brown, James Bottomley, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

[-- Attachment #1: Type: TEXT/PLAIN, Size: 804 bytes --]

On Sun, 3 May 2009, Willy Tarreau wrote:

> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
>> <akpm@linux-foundation.org> wrote:
>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
>>>
>>>> This is a repost of DRBD
>>>
>>> Is it being used anywhere for anything?  If so, where and what?
>>
>> One popular application is to run iSCSI and HA software on top of DRBD
>> in order to build a highly available iSCSI storage target.
>
> Confirmed, I have several customers who're doing exactly that.

I will also say that there are a lot of us out here who would have a use 
for DRDB in our HA setups, but have held off implementing it specificly 
because it's not yet in the upstream kernel.

David Lang

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-04-30 11:26 Philipp Reisner
  2009-05-01  8:59 ` Andrew Morton
@ 2009-05-03  5:53 ` Neil Brown
  2009-05-03  6:24   ` david
  2009-05-03  8:29   ` Lars Ellenberg
  1 sibling, 2 replies; 44+ messages in thread
From: Neil Brown @ 2009-05-03  5:53 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: linux-kernel, Jens Axboe, Greg KH, James Bottomley, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg

On Thursday April 30, philipp.reisner@linbit.com wrote:
> Hi,
> 
> This is a repost of DRBD, to keep you updated about the ongoing
> cleanups and improvements.
> 
> Patch set attached. Git tree available:
> git pull git://git.drbd.org/linux-2.6-drbd.git drbd
> 
> We are looking for reviews!
> 
> Description
> 
>   DRBD is a shared-nothing, synchronously replicated block device. It
>   is designed to serve as a building block for high availability
>   clusters and in this context, is a "drop-in" replacement for shared
>   storage. Simplistically, you could see it as a network RAID 1.

I know this is minor, but it bugs me every time I see that phrase
"shared-nothing".   Surely the network is shared?? And the code...
Can you just say "DRBD is a synchronously replicated block device"?
or would we have to call it SRBD then?
Or maybe "shared-nothing" is an accepted technical term in the
clustering world??

> 
>   Although I use the "RAID1+NBD" metaphor myself, recent discussion
>   unveiled that one needs to understand the differences as well.
>   Here are just two examples of that:

All this should probably be in a patch against Documentation/drbd.txt 

> 
>    1) Think of a two node HA cluster. Node A is active ('primary' in DRBD
>     speak) has the filesystem mounted and the application running. Node B is
>     in standby mode ('secondary' in DRBD speak).

If there some strong technical reason to only allow 2 nodes?  Was it
Asimov who said the only sensible numbers were 0, 1, and infinity?
(People still get surprised that md/raid1 can do 2 or 3 or n drives,
and that md/raid5 can handle just 2 :-)

> 
>     We loose network connectivity, the primary node continues to run, the
         lose
>     secondary no longer gets updates.
> 
>     Then we have a complete power failure, both nodes are down. Then they
>     power up the data center again, but at first the get only the power
                                                   they
>     circuit of node B up and running again.
> 
>     Should node B offer the service right now ?
>       ( DRBD has configurable policies for that )
> 
>     Later on they manage to get node A up and running again, now lets assume
>     node B was chosen to be the new primary node. What needs to be done ?
> 
>     Modifications on B since it became primary needs to be resynced to A.
>     Modifications on A sind it lost contact to B needs to be taken out.
> 
>     DRBD does that.
> 
>     How do you fit that into a RAID1+NBD model ? NBD is just a block
>     transport, it does not offer the ability to exchange dirty bitmaps or
>     data generation identifiers, nor does the RAID1 code has a concept of
>     that.

Not 100% true, but I - at least partly -  get your point.
As md stores bitmaps and data generation identifiers on the block
device, these can be transferred over NBD just like any other data on
the block device.
However I think that part of your point is that DRBD can transfer them
more efficiently (e.g. it compresses the bitmap before transferring it
-  I assume the compression you use is much more effective than gzip??
else why both to code your own).
I suspect there is more to your point that I am missing.
You say "nor does the RAID1 code has a concept of that".  It isn't
clear what you are referring to.  RAID1 does have a concept of dirty
bitmaps as you know, and it does have a concept of data generation,
though it is quite possibly weaker than the concept that DRBD has.
I'd need to explore the DRBD code more to be sure.


> 
>    2) When using DRBD over small bandwidth links, one has to run a resync,
>     DRBD offers the option to do a "checksum based resync". Similar to rsync
>     it at first only exchanges a checksum, and transmits the whole data
>     block only if the checksums differ.
> 
>     That again is something that does not fit into the concepts of
>     NBD or RAID1.

Interesting idea....  RAID1 does have a mode where it reads both (all)
devices and compares them to see if they match or not.  Doing this
compare with checksums rather than memcmp would not be an enormous
change.

I'm beginning to imagine an enhanced NBD as a model for what DRBD
does.
This enhanced NBD not only supports read and write of blocks but also:

   - maintains the local bitmap and sets bits before allowing a write
   - can return a strong checksum rather than the data of a block
   - provides sequence numbers in a way that I don't fully understand
     yet, but which allows consistent write ordering.
   - allows reads to be compressed so that the bitmap can be
     transferred efficiently.

I can imagine that md/raid1 could be made to work well with an
enhanced NBD like this.

> 
>   DRBD can also be used in dual-Primary mode (device writable on both
>   nodes), which means it can exhibit shared disk semantics in a
>   shared-nothing cluster.  Needless to say, on top of dual-Primary
>   DRBD utilizing a cluster file system is necessary to maintain for
>   cache coherency.
> 
>   More background on this can be found in this paper:
>     http://www.drbd.org/fileadmin/drbd/publications/drbd8.pdf
> 
>   Beyond that, DRBD addresses various issues of cluster partitioning,
>   which the MD/NBD stack, to the best of our knowledge, does not
>   solve. The above-mentioned paper goes into some detail about that as
>   well.

Agreed - MD/NBD could probably be easily confused by cluster
partitioning, though I suspect that in many simple cases it would get
it right.  I haven't given it enough thought to be sure.  I doubt the
enhancements necessary would be very significant though.

> 
>   DRBD can operate in synchronous mode, or in asynchronous mode. I want
>   to point out that we guarantee not to violate a single possible write
>   after write dependency when writing on the standby node. More on that
>   can be found in this paper:
>     http://www.drbd.org/fileadmin/drbd/publications/drbd_lk9.pdf

I really must read and understand this paper..


So... what would you think of working towards incorporating all of the
DRBD functionality into md/raid1??
I suspect that it would be a mutually beneficial exercise, except for
the small fact that it would take a significant amount of time and
effort.  I'd be will to shuffle some priorities and put in some effort
if it was a direction that you would be open to exploring.

Whether the current DRBD code gets merged or not is possibly a
separate question, though I would hope that if we followed the path of
merging DRBD into md/raid1, then any duplicate code would eventually be
excised from the kernel.

What do you think?

NeilBrown

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03  5:53 ` Neil Brown
@ 2009-05-03  6:24   ` david
  2009-05-03  8:29   ` Lars Ellenberg
  1 sibling, 0 replies; 44+ messages in thread
From: david @ 2009-05-03  6:24 UTC (permalink / raw)
  To: Neil Brown
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche, Lars Ellenberg

I am not a DRDB developer, but I can answer some of your questions below.

On Sun, 3 May 2009, Neil Brown wrote:

> On Thursday April 30, philipp.reisner@linbit.com wrote:
>> Hi,
>>
>> This is a repost of DRBD, to keep you updated about the ongoing
>> cleanups and improvements.
>>
>> Patch set attached. Git tree available:
>> git pull git://git.drbd.org/linux-2.6-drbd.git drbd
>>
>> We are looking for reviews!
>>
>> Description
>>
>>   DRBD is a shared-nothing, synchronously replicated block device. It
>>   is designed to serve as a building block for high availability
>>   clusters and in this context, is a "drop-in" replacement for shared
>>   storage. Simplistically, you could see it as a network RAID 1.
>
> I know this is minor, but it bugs me every time I see that phrase
> "shared-nothing".   Surely the network is shared??

the logical network(s) as a whole are shared, but physicaly they can be 
redundant, multi-pathed, etc.

> And the code...
> Can you just say "DRBD is a synchronously replicated block device"?
> or would we have to call it SRBD then?
> Or maybe "shared-nothing" is an accepted technical term in the
> clustering world??

DRDB can be configured to be synchronous or asynchronous.

'shared-nothing' is a accepted technical term in the clustering world for 
when two systems are not using any single device.

in the case of a network, I commonly setup systems where the network has 
two switches (connected togeather with fiber so that an electrical problem 
in one switch cannot short out the other) with the primary box plugged 
into one switch and the backup box plugged into another. I also make sure 
that my systems primary and backup systems are in seperate racks, so that 
if something goes wrong in one rack that causes an excessive amount of 
heat it won't affect the backup systems (and yes, this has happened to me 
when I got lazy and stopped checking on this)

at this point the network switch is not shared (although the logical 
network is)

in the case of disk storage the common situation is 'shared-disk' where 
you have one disk array and both machines are plugged into it.

this gives you a single point of failure if the disk array crashes (even 
if it has redundant controllers, power supplies, etc things still happen), 
and the disk array can only be in one physical location.

DRDB lets you logicly setup your systems as if they were a 'shared-disk' 
architecture, but with the hardware being 'shared-nothing'

you can have the two halves of the cluster in different states, so that 
even a major disaster like a earthquake won't kill the system. (a classic 
case of 'shared-nothing'

>>
>>    1) Think of a two node HA cluster. Node A is active ('primary' in DRBD
>>     speak) has the filesystem mounted and the application running. Node B is
>>     in standby mode ('secondary' in DRBD speak).
>
> If there some strong technical reason to only allow 2 nodes?  Was it
> Asimov who said the only sensible numbers were 0, 1, and infinity?
> (People still get surprised that md/raid1 can do 2 or 3 or n drives,
> and that md/raid5 can handle just 2 :-)

in this case we have 1 replica (or '1 other machine'), so we are on an 
'interesting number' ;-)

many people would love to see DRDB extended beyond this, but my 
understanding is that doing so in non-trivial.

>>   DRBD can also be used in dual-Primary mode (device writable on both
>>   nodes), which means it can exhibit shared disk semantics in a
>>   shared-nothing cluster.  Needless to say, on top of dual-Primary
>>   DRBD utilizing a cluster file system is necessary to maintain for
>>   cache coherency.
>>
>>   More background on this can be found in this paper:
>>     http://www.drbd.org/fileadmin/drbd/publications/drbd8.pdf
>>
>>   Beyond that, DRBD addresses various issues of cluster partitioning,
>>   which the MD/NBD stack, to the best of our knowledge, does not
>>   solve. The above-mentioned paper goes into some detail about that as
>>   well.
>
> Agreed - MD/NBD could probably be easily confused by cluster
> partitioning, though I suspect that in many simple cases it would get
> it right.  I haven't given it enough thought to be sure.  I doubt the
> enhancements necessary would be very significant though.

think of two different threads doing writes directly to their side of the 
mirror and the system needs to notice this happening and copy the data to 
the other half of the mirror (with GFS working above you to coordinate the 
two threads and make sure they don't make conflicting writes)

it's not a trivial task.

David Lang

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03  5:53 ` Neil Brown
  2009-05-03  6:24   ` david
@ 2009-05-03  8:29   ` Lars Ellenberg
  2009-05-03 11:00     ` Neil Brown
  1 sibling, 1 reply; 44+ messages in thread
From: Lars Ellenberg @ 2009-05-03  8:29 UTC (permalink / raw)
  To: Neil Brown
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche

On Sun, May 03, 2009 at 03:53:41PM +1000, Neil Brown wrote:
> I know this is minor, but it bugs me every time I see that phrase
> "shared-nothing". 

> Or maybe "shared-nothing" is an accepted technical term in the
> clustering world??

yes.

> All this should probably be in a patch against Documentation/drbd.txt 

Ok.

> >    1) Think of a two node HA cluster. Node A is active ('primary' in DRBD
> >     speak) has the filesystem mounted and the application running. Node B is
> >     in standby mode ('secondary' in DRBD speak).
> 
> If there some strong technical reason to only allow 2 nodes?

It "just" has not yet been implemented.
I'm working on that, though.

> >     How do you fit that into a RAID1+NBD model ? NBD is just a block
> >     transport, it does not offer the ability to exchange dirty bitmaps or
> >     data generation identifiers, nor does the RAID1 code has a concept of
> >     that.
> 
> Not 100% true, but I - at least partly -  get your point.
> As md stores bitmaps and data generation identifiers on the block
> device, these can be transferred over NBD just like any other data on
> the block device.

Do you have one dirty bitmap per mirror (yet) ?
Do you _merge_ them?

the "NBD" mirrors are remote, and once you lose communication,
they may be (and in general, you have to assume they are) modified
by which ever node they are directly attached to.

> However I think that part of your point is that DRBD can transfer them
> more efficiently (e.g. it compresses the bitmap before transferring it
> -  I assume the compression you use is much more effective than gzip??
> else why both to code your own).

No, the point was that we have one bitmap per mirror (though currently
number of mirrors == 2, only), and that we do merge them.

but to answer the question:
why bother to implement our own encoding?
because we know a lot about the data to be encoded.

the compression of the bitmap transfer we just added very recently.
for a bitmap, with large chunks of bits set or unset, it is efficient
to just code the runlength.
to use gzip in kernel would add yet an other huge overhead for code
tables and so on.
during testing of this encoding, applying it to an already gzip'ed file
was able to compress it even further, btw.
though on english plain text, gzip compression is _much_ more effective.

> You say "nor does the RAID1 code has a concept of that".  It isn't
> clear what you are referring to.

The concept that one of the mirrors (the "nbd" one in that picture)
may have been accessed independently, without MD knowning,
because the node this MD (and its "local" mirror) was living on
suffered from power outage.

The concept of both mirrors being modified _simultaneously_,
(e.g. living below a cluster file system).

> >    2) When using DRBD over small bandwidth links, one has to run a resync,
> >     DRBD offers the option to do a "checksum based resync". Similar to rsync
> >     it at first only exchanges a checksum, and transmits the whole data
> >     block only if the checksums differ.
> > 
> >     That again is something that does not fit into the concepts of
> >     NBD or RAID1.
> 
> Interesting idea....  RAID1 does have a mode where it reads both (all)
> devices and compares them to see if they match or not.  Doing this
> compare with checksums rather than memcmp would not be an enormous
> change.
> 
> I'm beginning to imagine an enhanced NBD as a model for what DRBD
> does.  This enhanced NBD not only supports read and write of blocks
> but also:
> 
>    - maintains the local bitmap and sets bits before allowing a write

right.

>    - can return a strong checksum rather than the data of a block

ok.

>    - provides sequence numbers in a way that I don't fully understand
>      yet, but which allows consistent write ordering.

yes, please.

>    - allows reads to be compressed so that the bitmap can be
>      transferred efficiently.

yep.

add to that
     - can exchange data generations on handshake,
     - can refuse the handshake (consistent data,
       but evolved differently than the other copy;
       diverging data sets detected!)
     - is bi-directional, can _push_ writes!

and whatever else I forgot just now.

> I can imagine that md/raid1 could be made to work well with an
> enhanced NBD like this.

of course.

> >   DRBD can also be used in dual-Primary mode (device writable on both
> >   nodes), which means it can exhibit shared disk semantics in a
> >   shared-nothing cluster.  Needless to say, on top of dual-Primary
> >   DRBD utilizing a cluster file system is necessary to maintain for
> >   cache coherency.
> > 
> >   More background on this can be found in this paper:
> >     http://www.drbd.org/fileadmin/drbd/publications/drbd8.pdf
> > 
> >   Beyond that, DRBD addresses various issues of cluster partitioning,
> >   which the MD/NBD stack, to the best of our knowledge, does not
> >   solve. The above-mentioned paper goes into some detail about that as
> >   well.
> 
> Agreed - MD/NBD could probably be easily confused by cluster
> partitioning, though I suspect that in many simple cases it would get
> it right.  I haven't given it enough thought to be sure.  I doubt the
> enhancements necessary would be very significant though.

The most significant part is probably the bidirectional nature
and the "refuse it" part of the handshake.

> >   DRBD can operate in synchronous mode, or in asynchronous mode. I want
> >   to point out that we guarantee not to violate a single possible write
> >   after write dependency when writing on the standby node. More on that
> >   can be found in this paper:
> >     http://www.drbd.org/fileadmin/drbd/publications/drbd_lk9.pdf
> 
> I really must read and understand this paper..
> 
> 
> So... what would you think of working towards incorporating all of the
> DRBD functionality into md/raid1??
> I suspect that it would be a mutually beneficial exercise, except for
> the small fact that it would take a significant amount of time and
> effort.  I'd be will to shuffle some priorities and put in some effort
> if it was a direction that you would be open to exploring.

Sure. But yes, full ack on the time and effort part ;)

> Whether the current DRBD code gets merged or not is possibly a
> separate question, though I would hope that if we followed the path of
> merging DRBD into md/raid1, then any duplicate code would eventually be
> excised from the kernel.

Rumor [http://lwn.net/Articles/326818/] has it, that the various in
kernel raid implementations are being unified right now, anyways?

If you want to stick to "replication is almost identical to RAID1",
best not to forget "this may be a remote mirror", there may be more than
one entity accessing it, this may be part of a bi-directional
(active-active) replication setup.

For further ideas on what could be done with replication (enhancing the
strict "raid1" notion), see also
http://www.drbd.org/fileadmin/drbd/publications/drbd9.linux-kongress.2008.pdf

 - time shift replication
 - generic point in time recovery of block device data
 - (remote) backup by periodically, round-robin re-sync of
   "raid" members, then "dropping" them again.
 ...

No useable code on those ideas, yet,
but a lot of thought. It is not all handwaving.

	Lars

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03  5:36     ` Willy Tarreau
  2009-05-03  5:40       ` david
@ 2009-05-03 10:06       ` Philipp Reisner
  2009-05-03 10:15         ` Thomas Backlund
  1 sibling, 1 reply; 44+ messages in thread
From: Philipp Reisner @ 2009-05-03 10:06 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Bart Van Assche, Andrew Morton, linux-kernel, Jens Axboe, Greg KH,
	Neil Brown, James Bottomley, Sam Ravnborg, Dave Jones,
	Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

Am Sonntag 03 Mai 2009 07:36:00 schrieb Willy Tarreau:
> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
> > On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
> >
> > <akpm@linux-foundation.org> wrote:
> > > On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner 
<philipp.reisner@linbit.com> wrote:
> > >> This is a repost of DRBD
> > >
> > > Is it being used anywhere for anything? ?If so, where and what?
> >
> > One popular application is to run iSCSI and HA software on top of DRBD
> > in order to build a highly available iSCSI storage target.
>
> Confirmed, I have several customers who're doing exactly that.
>

Besides storage targets, DRBD is also very popular for databases,
it is widely used with PostgreSQL and MySQL. It is advertised by both
database projects, to use their DB on top of DRBD to form HA clusters.

Raw numbers of installations:

We have an opt-in global usage counter. See http://www.drbd.org/usage/year/
If we assume that 30% of all users agree that their DRBD installation gets
counted, then we have more then 12000 new installations in April.
(4245 Installations where counted)

It seems that nowadays most of our users get DRBD through their distributions. 
Distributions that include DRBD are (list incomplete): 
Debian, Ubuntu, SLES, CentOS

-Phil

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 10:06       ` Philipp Reisner
@ 2009-05-03 10:15         ` Thomas Backlund
  0 siblings, 0 replies; 44+ messages in thread
From: Thomas Backlund @ 2009-05-03 10:15 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Philipp Reisner skrev:
> 
> It seems that nowadays most of our users get DRBD through their distributions. 
> Distributions that include DRBD are (list incomplete): 
> Debian, Ubuntu, SLES, CentOS

Mandriva has been including it since ~4 years too...

--
Thomas

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03  8:29   ` Lars Ellenberg
@ 2009-05-03 11:00     ` Neil Brown
  2009-05-03 21:32       ` Lars Ellenberg
  0 siblings, 1 reply; 44+ messages in thread
From: Neil Brown @ 2009-05-03 11:00 UTC (permalink / raw)
  To: Lars Ellenberg
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche

On Sunday May 3, lars.ellenberg@linbit.com wrote:
> > If there some strong technical reason to only allow 2 nodes?
> 
> It "just" has not yet been implemented.
> I'm working on that, though.

:-)

> 
> > >     How do you fit that into a RAID1+NBD model ? NBD is just a block
> > >     transport, it does not offer the ability to exchange dirty bitmaps or
> > >     data generation identifiers, nor does the RAID1 code has a concept of
> > >     that.
> > 
> > Not 100% true, but I - at least partly -  get your point.
> > As md stores bitmaps and data generation identifiers on the block
> > device, these can be transferred over NBD just like any other data on
> > the block device.
> 
> Do you have one dirty bitmap per mirror (yet) ?
> Do you _merge_ them?

md doesn't merge bitmaps yet.  However if I found a need to, I would
simple read a bitmap in userspace and feed it into the kernel via 
	/sys/block/mdX/md/md/bitmap_set_bits

We sort-of have one bitmap per mirror, but only because the one bitmap
is mirrored...

> 
> the "NBD" mirrors are remote, and once you lose communication,
> they may be (and in general, you have to assume they are) modified
> by which ever node they are directly attached to.
> 
> > However I think that part of your point is that DRBD can transfer them
> > more efficiently (e.g. it compresses the bitmap before transferring it
> > -  I assume the compression you use is much more effective than gzip??
> > else why both to code your own).
> 
> No, the point was that we have one bitmap per mirror (though currently
> number of mirrors == 2, only), and that we do merge them.

Right.  I imagine much of the complexity of that could be handled in
user-space while setting an a DRBD instance (??).

> 
> but to answer the question:
> why bother to implement our own encoding?
> because we know a lot about the data to be encoded.
> 
> the compression of the bitmap transfer we just added very recently.
> for a bitmap, with large chunks of bits set or unset, it is efficient
> to just code the runlength.
> to use gzip in kernel would add yet an other huge overhead for code
> tables and so on.
> during testing of this encoding, applying it to an already gzip'ed file
> was able to compress it even further, btw.
> though on english plain text, gzip compression is _much_ more effective.

I just tried a little experiment.
I created a 128meg file and randomly set 1000 bits in it.
I compressed it with "gzip --best" and the result was 4Meg.  Not
particularly impressive.
I then tried to compress it wit bzip2 and got 3452 bytes.
Now *that* is impressive.  I suspect your encoding might do a little
better, but I wonder if it is worth the effort.
I'm not certain that my test file is entirely realistic, but it is
still an interesting experiment.

Why do you do this compression in the kernel?  It seems to me that it
would be quite practical to do it all in user-space, thus making it
really easy to use pre-existing libraries.

BTW, the kernel already contains various compression code as part of
the crypto API.

> 
> > You say "nor does the RAID1 code has a concept of that".  It isn't
> > clear what you are referring to.
> 
> The concept that one of the mirrors (the "nbd" one in that picture)
> may have been accessed independently, without MD knowning,
> because the node this MD (and its "local" mirror) was living on
> suffered from power outage.
> 
> The concept of both mirrors being modified _simultaneously_,
> (e.g. living below a cluster file system).

Yes, that is an important concept.  Certainly one of the bits that
would need to be added to md.

> > Whether the current DRBD code gets merged or not is possibly a
> > separate question, though I would hope that if we followed the path of
> > merging DRBD into md/raid1, then any duplicate code would eventually be
> > excised from the kernel.
> 
> Rumor [http://lwn.net/Articles/326818/] has it, that the various in
> kernel raid implementations are being unified right now, anyways?

I'm not holding my breath on that one...  
I think that merging DRBD with md/raid1 would be significantly easier
that any sort of merge between md and dm.  But (in either case) I'll
do what I can to assist any effort that is technically sound.


> 
> If you want to stick to "replication is almost identical to RAID1",
> best not to forget "this may be a remote mirror", there may be more than
> one entity accessing it, this may be part of a bi-directional
> (active-active) replication setup.
> 
> For further ideas on what could be done with replication (enhancing the
> strict "raid1" notion), see also
> http://www.drbd.org/fileadmin/drbd/publications/drbd9.linux-kongress.2008.pdf
> 
>  - time shift replication
>  - generic point in time recovery of block device data
>  - (remote) backup by periodically, round-robin re-sync of
>    "raid" members, then "dropping" them again.
>  ...
> 
> No useable code on those ideas, yet,
> but a lot of thought. It is not all handwaving.

:-)

I'll have to do a bit of reading I see.  I'll then try to rough out a
design and plan for merging DRBD functionality with md/raid1.  At the
very least that would give me enough background understanding to be
able to sensibly review your code submission.

NeilBrown

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03  5:40       ` david
@ 2009-05-03 14:21         ` James Bottomley
  2009-05-03 14:36           ` david
  0 siblings, 1 reply; 44+ messages in thread
From: James Bottomley @ 2009-05-03 14:21 UTC (permalink / raw)
  To: david
  Cc: Willy Tarreau, Bart Van Assche, Andrew Morton, Philipp Reisner,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
> On Sun, 3 May 2009, Willy Tarreau wrote:
> 
> > On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
> >> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
> >> <akpm@linux-foundation.org> wrote:
> >>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
> >>>
> >>>> This is a repost of DRBD
> >>>
> >>> Is it being used anywhere for anything?  If so, where and what?
> >>
> >> One popular application is to run iSCSI and HA software on top of DRBD
> >> in order to build a highly available iSCSI storage target.
> >
> > Confirmed, I have several customers who're doing exactly that.
> 
> I will also say that there are a lot of us out here who would have a use 
> for DRDB in our HA setups, but have held off implementing it specificly 
> because it's not yet in the upstream kernel.

Actually, that's not a particularly strong reason because we already
have an in-kernel replicator that has much of the functionality of drbd
that you could use.  The main reason for wanting drbd in kernel is that
it has a *current* user base.

Both the in kernel md/nbd and drbd do sync and async replication with
primary side bitmaps.  The main differences are:

      * md/nbd can do 1 to N replication,
      * drbd can do active/active replication (useful for cluster
        filesystems)
      * The chunk size of the md/nbd is tunable
      * With the updated nbd-tools, current md/nbd can do point in time
        rollback on transaction logged secondaries (a BCS requirement)
      * drbd manages the mirror state explicitly, md/nbd needs a user
        space helper

And probably a few others I forget.

James



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 14:21         ` James Bottomley
@ 2009-05-03 14:36           ` david
  2009-05-03 14:45             ` James Bottomley
  0 siblings, 1 reply; 44+ messages in thread
From: david @ 2009-05-03 14:36 UTC (permalink / raw)
  To: James Bottomley
  Cc: Willy Tarreau, Bart Van Assche, Andrew Morton, Philipp Reisner,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

On Sun, 3 May 2009, James Bottomley wrote:

> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> 
> On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
>> On Sun, 3 May 2009, Willy Tarreau wrote:
>>
>>> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
>>>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
>>>> <akpm@linux-foundation.org> wrote:
>>>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
>>>>>
>>>>>> This is a repost of DRBD
>>>>>
>>>>> Is it being used anywhere for anything?  If so, where and what?
>>>>
>>>> One popular application is to run iSCSI and HA software on top of DRBD
>>>> in order to build a highly available iSCSI storage target.
>>>
>>> Confirmed, I have several customers who're doing exactly that.
>>
>> I will also say that there are a lot of us out here who would have a use
>> for DRDB in our HA setups, but have held off implementing it specificly
>> because it's not yet in the upstream kernel.
>
> Actually, that's not a particularly strong reason because we already
> have an in-kernel replicator that has much of the functionality of drbd
> that you could use.  The main reason for wanting drbd in kernel is that
> it has a *current* user base.
>
> Both the in kernel md/nbd and drbd do sync and async replication with
> primary side bitmaps.  The main differences are:
>
>      * md/nbd can do 1 to N replication,
>      * drbd can do active/active replication (useful for cluster
>        filesystems)
>      * The chunk size of the md/nbd is tunable
>      * With the updated nbd-tools, current md/nbd can do point in time
>        rollback on transaction logged secondaries (a BCS requirement)
>      * drbd manages the mirror state explicitly, md/nbd needs a user
>        space helper
>
> And probably a few others I forget.

one very big one:

DRDB has better support for dealing with split brain situations and 
recovering from them.

David Lang

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 14:36           ` david
@ 2009-05-03 14:45             ` James Bottomley
  2009-05-03 14:56               ` david
  2009-05-04  8:28               ` Philipp Reisner
  0 siblings, 2 replies; 44+ messages in thread
From: James Bottomley @ 2009-05-03 14:45 UTC (permalink / raw)
  To: david
  Cc: Willy Tarreau, Bart Van Assche, Andrew Morton, Philipp Reisner,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

On Sun, 2009-05-03 at 07:36 -0700, david@lang.hm wrote:
> On Sun, 3 May 2009, James Bottomley wrote:
> 
> > Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> > 
> > On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
> >> On Sun, 3 May 2009, Willy Tarreau wrote:
> >>
> >>> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
> >>>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
> >>>> <akpm@linux-foundation.org> wrote:
> >>>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
> >>>>>
> >>>>>> This is a repost of DRBD
> >>>>>
> >>>>> Is it being used anywhere for anything?  If so, where and what?
> >>>>
> >>>> One popular application is to run iSCSI and HA software on top of DRBD
> >>>> in order to build a highly available iSCSI storage target.
> >>>
> >>> Confirmed, I have several customers who're doing exactly that.
> >>
> >> I will also say that there are a lot of us out here who would have a use
> >> for DRDB in our HA setups, but have held off implementing it specificly
> >> because it's not yet in the upstream kernel.
> >
> > Actually, that's not a particularly strong reason because we already
> > have an in-kernel replicator that has much of the functionality of drbd
> > that you could use.  The main reason for wanting drbd in kernel is that
> > it has a *current* user base.
> >
> > Both the in kernel md/nbd and drbd do sync and async replication with
> > primary side bitmaps.  The main differences are:
> >
> >      * md/nbd can do 1 to N replication,
> >      * drbd can do active/active replication (useful for cluster
> >        filesystems)
> >      * The chunk size of the md/nbd is tunable
> >      * With the updated nbd-tools, current md/nbd can do point in time
> >        rollback on transaction logged secondaries (a BCS requirement)
> >      * drbd manages the mirror state explicitly, md/nbd needs a user
> >        space helper
> >
> > And probably a few others I forget.
> 
> one very big one:
> 
> DRDB has better support for dealing with split brain situations and 
> recovering from them.

I don't really think so.  The decision about which (or if a) node should
be killed lies with the HA harness outside of the province of the
replication.

One could argue that the symmetric active mode of drbd allows both nodes
to continue rather than having the harness make a kill decision about
one.  However, if they both alter the same data, you get an
irreconcilable data corruption fault which, one can argue, is directly
counter to HA principles and so allowing drbd continuation is arguably
the wrong thing to do.

James



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 14:45             ` James Bottomley
@ 2009-05-03 14:56               ` david
  2009-05-03 15:09                 ` James Bottomley
  2009-05-04  8:28               ` Philipp Reisner
  1 sibling, 1 reply; 44+ messages in thread
From: david @ 2009-05-03 14:56 UTC (permalink / raw)
  To: James Bottomley
  Cc: Willy Tarreau, Bart Van Assche, Andrew Morton, Philipp Reisner,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

On Sun, 3 May 2009, James Bottomley wrote:

> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> 
> On Sun, 2009-05-03 at 07:36 -0700, david@lang.hm wrote:
>> On Sun, 3 May 2009, James Bottomley wrote:
>>
>>> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
>>>
>>> On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
>>>> On Sun, 3 May 2009, Willy Tarreau wrote:
>>>>
>>>>> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
>>>>>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
>>>>>> <akpm@linux-foundation.org> wrote:
>>>>>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
>>>>>>>
>>>>>>>> This is a repost of DRBD
>>>>>>>
>>>>>>> Is it being used anywhere for anything?  If so, where and what?
>>>>>>
>>>>>> One popular application is to run iSCSI and HA software on top of DRBD
>>>>>> in order to build a highly available iSCSI storage target.
>>>>>
>>>>> Confirmed, I have several customers who're doing exactly that.
>>>>
>>>> I will also say that there are a lot of us out here who would have a use
>>>> for DRDB in our HA setups, but have held off implementing it specificly
>>>> because it's not yet in the upstream kernel.
>>>
>>> Actually, that's not a particularly strong reason because we already
>>> have an in-kernel replicator that has much of the functionality of drbd
>>> that you could use.  The main reason for wanting drbd in kernel is that
>>> it has a *current* user base.
>>>
>>> Both the in kernel md/nbd and drbd do sync and async replication with
>>> primary side bitmaps.  The main differences are:
>>>
>>>      * md/nbd can do 1 to N replication,
>>>      * drbd can do active/active replication (useful for cluster
>>>        filesystems)
>>>      * The chunk size of the md/nbd is tunable
>>>      * With the updated nbd-tools, current md/nbd can do point in time
>>>        rollback on transaction logged secondaries (a BCS requirement)
>>>      * drbd manages the mirror state explicitly, md/nbd needs a user
>>>        space helper
>>>
>>> And probably a few others I forget.
>>
>> one very big one:
>>
>> DRDB has better support for dealing with split brain situations and
>> recovering from them.
>
> I don't really think so.  The decision about which (or if a) node should
> be killed lies with the HA harness outside of the province of the
> replication.
>
> One could argue that the symmetric active mode of drbd allows both nodes
> to continue rather than having the harness make a kill decision about
> one.  However, if they both alter the same data, you get an
> irreconcilable data corruption fault which, one can argue, is directly
> counter to HA principles and so allowing drbd continuation is arguably
> the wrong thing to do.

but the issue is that at the time the failure is taking place, neither 
side _knows_ that the other side is running. In fact, they both think that 
the other side is dead.

with DRDB, when the two sides start talking again they will discover that 
they are different and complain, loudly, to the sysadmin that they need 
help

with md/ndb you have the situation where both sides will try to resync to 
the other side as soon as the packets can get through. this can end up 
corrupting both sides if it's not caught fast enough

David Lang

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 14:56               ` david
@ 2009-05-03 15:09                 ` James Bottomley
  2009-05-03 15:22                   ` david
  0 siblings, 1 reply; 44+ messages in thread
From: James Bottomley @ 2009-05-03 15:09 UTC (permalink / raw)
  To: david
  Cc: Willy Tarreau, Bart Van Assche, Andrew Morton, Philipp Reisner,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

On Sun, 2009-05-03 at 07:56 -0700, david@lang.hm wrote:
> On Sun, 3 May 2009, James Bottomley wrote:
> 
> > Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> > 
> > On Sun, 2009-05-03 at 07:36 -0700, david@lang.hm wrote:
> >> On Sun, 3 May 2009, James Bottomley wrote:
> >>
> >>> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> >>>
> >>> On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
> >>>> On Sun, 3 May 2009, Willy Tarreau wrote:
> >>>>
> >>>>> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
> >>>>>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
> >>>>>> <akpm@linux-foundation.org> wrote:
> >>>>>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
> >>>>>>>
> >>>>>>>> This is a repost of DRBD
> >>>>>>>
> >>>>>>> Is it being used anywhere for anything?  If so, where and what?
> >>>>>>
> >>>>>> One popular application is to run iSCSI and HA software on top of DRBD
> >>>>>> in order to build a highly available iSCSI storage target.
> >>>>>
> >>>>> Confirmed, I have several customers who're doing exactly that.
> >>>>
> >>>> I will also say that there are a lot of us out here who would have a use
> >>>> for DRDB in our HA setups, but have held off implementing it specificly
> >>>> because it's not yet in the upstream kernel.
> >>>
> >>> Actually, that's not a particularly strong reason because we already
> >>> have an in-kernel replicator that has much of the functionality of drbd
> >>> that you could use.  The main reason for wanting drbd in kernel is that
> >>> it has a *current* user base.
> >>>
> >>> Both the in kernel md/nbd and drbd do sync and async replication with
> >>> primary side bitmaps.  The main differences are:
> >>>
> >>>      * md/nbd can do 1 to N replication,
> >>>      * drbd can do active/active replication (useful for cluster
> >>>        filesystems)
> >>>      * The chunk size of the md/nbd is tunable
> >>>      * With the updated nbd-tools, current md/nbd can do point in time
> >>>        rollback on transaction logged secondaries (a BCS requirement)
> >>>      * drbd manages the mirror state explicitly, md/nbd needs a user
> >>>        space helper
> >>>
> >>> And probably a few others I forget.
> >>
> >> one very big one:
> >>
> >> DRDB has better support for dealing with split brain situations and
> >> recovering from them.
> >
> > I don't really think so.  The decision about which (or if a) node should
> > be killed lies with the HA harness outside of the province of the
> > replication.
> >
> > One could argue that the symmetric active mode of drbd allows both nodes
> > to continue rather than having the harness make a kill decision about
> > one.  However, if they both alter the same data, you get an
> > irreconcilable data corruption fault which, one can argue, is directly
> > counter to HA principles and so allowing drbd continuation is arguably
> > the wrong thing to do.
> 
> but the issue is that at the time the failure is taking place, neither 
> side _knows_ that the other side is running. In fact, they both think that 
> the other side is dead.

Resolving this is the job of the HA harness, as I said ... the usual
solution being either third node pings or confirmable switchover.

> with DRDB, when the two sides start talking again they will discover that 
> they are different and complain, loudly, to the sysadmin that they need 
> help

The object of HA is to prevent data becoming toast, not to point it out
to the sysadmin after the fact.

> with md/ndb you have the situation where both sides will try to resync to 
> the other side as soon as the packets can get through. this can end up 
> corrupting both sides if it's not caught fast enough

Actually, that's just your implementation: md/nbd does nothing to
re-establish the replication, it has to be done by the HA harness after
split brain resolution.  What a correct harness would do is to compare
the HA event log and the intent logs to see if there had been activity
to both sides after loss of contact and, if their had, to flag the data
corruption problem and not resume replication.

This corruption situation isn't unique to replication ... any time you
may potentially have allowed both sides to write to a data store, you
get it, that's why it's the job of the HA harness to sort out whether a
split brain happened and what to do about it *first*.

James




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 15:09                 ` James Bottomley
@ 2009-05-03 15:22                   ` david
  2009-05-03 15:38                     ` James Bottomley
  0 siblings, 1 reply; 44+ messages in thread
From: david @ 2009-05-03 15:22 UTC (permalink / raw)
  To: James Bottomley
  Cc: Willy Tarreau, Bart Van Assche, Andrew Morton, Philipp Reisner,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

On Sun, 3 May 2009, James Bottomley wrote:

>> On Sun, 3 May 2009, James Bottomley wrote:
>>
>>> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
>>>
>>> On Sun, 2009-05-03 at 07:36 -0700, david@lang.hm wrote:
>>>> On Sun, 3 May 2009, James Bottomley wrote:
>>>>
>>>>> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
>>>>>
>>>>> On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
>>>>>> On Sun, 3 May 2009, Willy Tarreau wrote:
>>>>>>
>>>>>>> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
>>>>>>>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
>>>>>>>> <akpm@linux-foundation.org> wrote:
>>>>>>>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
>>>>>>>>>
>>>>>>>>>> This is a repost of DRBD
>>>>>>>>>
>>>>>>>>> Is it being used anywhere for anything?  If so, where and what?
>>>>>>>>
>>>>>>>> One popular application is to run iSCSI and HA software on top of DRBD
>>>>>>>> in order to build a highly available iSCSI storage target.
>>>>>>>
>>>>>>> Confirmed, I have several customers who're doing exactly that.
>>>>>>
>>>>>> I will also say that there are a lot of us out here who would have a use
>>>>>> for DRDB in our HA setups, but have held off implementing it specificly
>>>>>> because it's not yet in the upstream kernel.
>>>>>
>>>>> Actually, that's not a particularly strong reason because we already
>>>>> have an in-kernel replicator that has much of the functionality of drbd
>>>>> that you could use.  The main reason for wanting drbd in kernel is that
>>>>> it has a *current* user base.
>>>>>
>>>>> Both the in kernel md/nbd and drbd do sync and async replication with
>>>>> primary side bitmaps.  The main differences are:
>>>>>
>>>>>      * md/nbd can do 1 to N replication,
>>>>>      * drbd can do active/active replication (useful for cluster
>>>>>        filesystems)
>>>>>      * The chunk size of the md/nbd is tunable
>>>>>      * With the updated nbd-tools, current md/nbd can do point in time
>>>>>        rollback on transaction logged secondaries (a BCS requirement)
>>>>>      * drbd manages the mirror state explicitly, md/nbd needs a user
>>>>>        space helper
>>>>>
>>>>> And probably a few others I forget.
>>>>
>>>> one very big one:
>>>>
>>>> DRDB has better support for dealing with split brain situations and
>>>> recovering from them.
>>>
>>> I don't really think so.  The decision about which (or if a) node should
>>> be killed lies with the HA harness outside of the province of the
>>> replication.
>>>
>>> One could argue that the symmetric active mode of drbd allows both nodes
>>> to continue rather than having the harness make a kill decision about
>>> one.  However, if they both alter the same data, you get an
>>> irreconcilable data corruption fault which, one can argue, is directly
>>> counter to HA principles and so allowing drbd continuation is arguably
>>> the wrong thing to do.
>>
>> but the issue is that at the time the failure is taking place, neither
>> side _knows_ that the other side is running. In fact, they both think that
>> the other side is dead.
>
> Resolving this is the job of the HA harness, as I said ... the usual
> solution being either third node pings or confirmable switchover.

and none of those solutions are failsafe in a distributed environment (in 
a local environment you can have a race to see which system powers off the 
other first to ensure that at most one is running, but you can't do that 
reliably remotely)

>> with DRDB, when the two sides start talking again they will discover that
>> they are different and complain, loudly, to the sysadmin that they need
>> help
>
> The object of HA is to prevent data becoming toast, not to point it out
> to the sysadmin after the fact.

it needs to do both

>> with md/ndb you have the situation where both sides will try to resync to
>> the other side as soon as the packets can get through. this can end up
>> corrupting both sides if it's not caught fast enough
>
> Actually, that's just your implementation: md/nbd does nothing to
> re-establish the replication, it has to be done by the HA harness after
> split brain resolution.  What a correct harness would do is to compare
> the HA event log and the intent logs to see if there had been activity
> to both sides after loss of contact and, if their had, to flag the data
> corruption problem and not resume replication.
>
> This corruption situation isn't unique to replication ... any time you
> may potentially have allowed both sides to write to a data store, you
> get it, that's why it's the job of the HA harness to sort out whether a
> split brain happened and what to do about it *first*.

but you can have packets sitting in the network buffers waiting to get to 
the remote machine, then once the connection is reestablished those 
packets will go out. no remounting needed., just connectivity restored. 
(this isn't as bad as if the system tries to re-sync to the temprarily 
unavailable drive by itself, but it can still corrupt things)

a cluster spread across different locations has problems to face that a 
cluster within easy cabling distance does not.

DRDB has been extensivly tested and build to survive in the harsher 
environment. md/ndb is a reasonable approximation for the simple 
enviornment of two servers in one datacenter, but that doesn't mean that 
it handles the rest of the possible conditions.

David Lang

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 15:22                   ` david
@ 2009-05-03 15:38                     ` James Bottomley
  2009-05-03 15:48                       ` david
  0 siblings, 1 reply; 44+ messages in thread
From: James Bottomley @ 2009-05-03 15:38 UTC (permalink / raw)
  To: david
  Cc: Willy Tarreau, Bart Van Assche, Andrew Morton, Philipp Reisner,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

On Sun, 2009-05-03 at 08:22 -0700, david@lang.hm wrote:
> On Sun, 3 May 2009, James Bottomley wrote:
> 
> >> On Sun, 3 May 2009, James Bottomley wrote:
> >>
> >>> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> >>>
> >>> On Sun, 2009-05-03 at 07:36 -0700, david@lang.hm wrote:
> >>>> On Sun, 3 May 2009, James Bottomley wrote:
> >>>>
> >>>>> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> >>>>>
> >>>>> On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
> >>>>>> On Sun, 3 May 2009, Willy Tarreau wrote:
> >>>>>>
> >>>>>>> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
> >>>>>>>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
> >>>>>>>> <akpm@linux-foundation.org> wrote:
> >>>>>>>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> This is a repost of DRBD
> >>>>>>>>>
> >>>>>>>>> Is it being used anywhere for anything?  If so, where and what?
> >>>>>>>>
> >>>>>>>> One popular application is to run iSCSI and HA software on top of DRBD
> >>>>>>>> in order to build a highly available iSCSI storage target.
> >>>>>>>
> >>>>>>> Confirmed, I have several customers who're doing exactly that.
> >>>>>>
> >>>>>> I will also say that there are a lot of us out here who would have a use
> >>>>>> for DRDB in our HA setups, but have held off implementing it specificly
> >>>>>> because it's not yet in the upstream kernel.
> >>>>>
> >>>>> Actually, that's not a particularly strong reason because we already
> >>>>> have an in-kernel replicator that has much of the functionality of drbd
> >>>>> that you could use.  The main reason for wanting drbd in kernel is that
> >>>>> it has a *current* user base.
> >>>>>
> >>>>> Both the in kernel md/nbd and drbd do sync and async replication with
> >>>>> primary side bitmaps.  The main differences are:
> >>>>>
> >>>>>      * md/nbd can do 1 to N replication,
> >>>>>      * drbd can do active/active replication (useful for cluster
> >>>>>        filesystems)
> >>>>>      * The chunk size of the md/nbd is tunable
> >>>>>      * With the updated nbd-tools, current md/nbd can do point in time
> >>>>>        rollback on transaction logged secondaries (a BCS requirement)
> >>>>>      * drbd manages the mirror state explicitly, md/nbd needs a user
> >>>>>        space helper
> >>>>>
> >>>>> And probably a few others I forget.
> >>>>
> >>>> one very big one:
> >>>>
> >>>> DRDB has better support for dealing with split brain situations and
> >>>> recovering from them.
> >>>
> >>> I don't really think so.  The decision about which (or if a) node should
> >>> be killed lies with the HA harness outside of the province of the
> >>> replication.
> >>>
> >>> One could argue that the symmetric active mode of drbd allows both nodes
> >>> to continue rather than having the harness make a kill decision about
> >>> one.  However, if they both alter the same data, you get an
> >>> irreconcilable data corruption fault which, one can argue, is directly
> >>> counter to HA principles and so allowing drbd continuation is arguably
> >>> the wrong thing to do.
> >>
> >> but the issue is that at the time the failure is taking place, neither
> >> side _knows_ that the other side is running. In fact, they both think that
> >> the other side is dead.
> >
> > Resolving this is the job of the HA harness, as I said ... the usual
> > solution being either third node pings or confirmable switchover.
> 
> and none of those solutions are failsafe in a distributed environment (in 
> a local environment you can have a race to see which system powers off the 
> other first to ensure that at most one is running, but you can't do that 
> reliably remotely)

Um, yes they are, that's why they're used.

Do you understand how they work?

Third node ping means that there has to be an external third node acting
as mediator (like a quorum device) ... usually in a third location.  A
node surviving has to make contact with it before failover can proceed
automatically (the running node has to be in contact to keep running).

Confirmable switchover is where the cluster detects the failure and
pages an admin to check on the remote and confirm or deny the switch
over manually.  Without the confirmation it just waits.

Both of these mechanisms are robust to split brain.  By and large most
enterprises I've seen go for confirmable switchover, but some do
implement third node ping.

> >> with DRDB, when the two sides start talking again they will discover that
> >> they are different and complain, loudly, to the sysadmin that they need
> >> help
> >
> > The object of HA is to prevent data becoming toast, not to point it out
> > to the sysadmin after the fact.
> 
> it needs to do both
> 
> >> with md/ndb you have the situation where both sides will try to resync to
> >> the other side as soon as the packets can get through. this can end up
> >> corrupting both sides if it's not caught fast enough
> >
> > Actually, that's just your implementation: md/nbd does nothing to
> > re-establish the replication, it has to be done by the HA harness after
> > split brain resolution.  What a correct harness would do is to compare
> > the HA event log and the intent logs to see if there had been activity
> > to both sides after loss of contact and, if their had, to flag the data
> > corruption problem and not resume replication.
> >
> > This corruption situation isn't unique to replication ... any time you
> > may potentially have allowed both sides to write to a data store, you
> > get it, that's why it's the job of the HA harness to sort out whether a
> > split brain happened and what to do about it *first*.
> 
> but you can have packets sitting in the network buffers waiting to get to 
> the remote machine, then once the connection is reestablished those 
> packets will go out. no remounting needed., just connectivity restored. 
> (this isn't as bad as if the system tries to re-sync to the temprarily 
> unavailable drive by itself, but it can still corrupt things)

This is an interesting thought, but not what happens.  As soon as the HA
harness stops replication, which it does at the instant failure is
detected, the closure of the socket kills all the in flight network
data.

There is an variant of this problem that occurs with device mapper
queue_if_no_path (on local disks) which does exactly what you say (keeps
unsaved data around in the queue forever), but that's fixed by not using
queue_if_no_path for HA.  Maybe that's what you were thinking of?

> a cluster spread across different locations has problems to face that a 
> cluster within easy cabling distance does not.
> 
> DRDB has been extensivly tested and build to survive in the harsher 
> environment.

There are commercial HA products based on md/nbd, so I'd say it's also
hardened for harsher environments

>  md/ndb is a reasonable approximation for the simple 
> enviornment of two servers in one datacenter, but that doesn't mean that 
> it handles the rest of the possible conditions.

The implementations I've seen do ... and that includes some fairly
exotic cascading WAN replication ones.

James



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 15:38                     ` James Bottomley
@ 2009-05-03 15:48                       ` david
  2009-05-03 16:02                         ` James Bottomley
  0 siblings, 1 reply; 44+ messages in thread
From: david @ 2009-05-03 15:48 UTC (permalink / raw)
  To: James Bottomley
  Cc: Willy Tarreau, Bart Van Assche, Andrew Morton, Philipp Reisner,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

On Sun, 3 May 2009, James Bottomley wrote:

> On Sun, 2009-05-03 at 08:22 -0700, david@lang.hm wrote:
>> On Sun, 3 May 2009, James Bottomley wrote:
>>
>>>> On Sun, 3 May 2009, James Bottomley wrote:
>>>>
>>>>> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
>>>>>
>>>>> On Sun, 2009-05-03 at 07:36 -0700, david@lang.hm wrote:
>>>>>> On Sun, 3 May 2009, James Bottomley wrote:
>>>>>>
>>>>>>> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
>>>>>>>
>>>>>>> On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
>>>>>>>> On Sun, 3 May 2009, Willy Tarreau wrote:
>>>>>>>>
>>>>>>>>> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
>>>>>>>>>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
>>>>>>>>>> <akpm@linux-foundation.org> wrote:
>>>>>>>>>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> This is a repost of DRBD
>>>>>>>>>>>
>>>>>>>>>>> Is it being used anywhere for anything?  If so, where and what?
>>>>>>>>>>
>>>>>>>>>> One popular application is to run iSCSI and HA software on top of DRBD
>>>>>>>>>> in order to build a highly available iSCSI storage target.
>>>>>>>>>
>>>>>>>>> Confirmed, I have several customers who're doing exactly that.
>>>>>>>>
>>>>>>>> I will also say that there are a lot of us out here who would have a use
>>>>>>>> for DRDB in our HA setups, but have held off implementing it specificly
>>>>>>>> because it's not yet in the upstream kernel.
>>>>>>>
>>>>>>> Actually, that's not a particularly strong reason because we already
>>>>>>> have an in-kernel replicator that has much of the functionality of drbd
>>>>>>> that you could use.  The main reason for wanting drbd in kernel is that
>>>>>>> it has a *current* user base.
>>>>>>>
>>>>>>> Both the in kernel md/nbd and drbd do sync and async replication with
>>>>>>> primary side bitmaps.  The main differences are:
>>>>>>>
>>>>>>>      * md/nbd can do 1 to N replication,
>>>>>>>      * drbd can do active/active replication (useful for cluster
>>>>>>>        filesystems)
>>>>>>>      * The chunk size of the md/nbd is tunable
>>>>>>>      * With the updated nbd-tools, current md/nbd can do point in time
>>>>>>>        rollback on transaction logged secondaries (a BCS requirement)
>>>>>>>      * drbd manages the mirror state explicitly, md/nbd needs a user
>>>>>>>        space helper
>>>>>>>
>>>>>>> And probably a few others I forget.
>>>>>>
>>>>>> one very big one:
>>>>>>
>>>>>> DRDB has better support for dealing with split brain situations and
>>>>>> recovering from them.
>>>>>
>>>>> I don't really think so.  The decision about which (or if a) node should
>>>>> be killed lies with the HA harness outside of the province of the
>>>>> replication.
>>>>>
>>>>> One could argue that the symmetric active mode of drbd allows both nodes
>>>>> to continue rather than having the harness make a kill decision about
>>>>> one.  However, if they both alter the same data, you get an
>>>>> irreconcilable data corruption fault which, one can argue, is directly
>>>>> counter to HA principles and so allowing drbd continuation is arguably
>>>>> the wrong thing to do.
>>>>
>>>> but the issue is that at the time the failure is taking place, neither
>>>> side _knows_ that the other side is running. In fact, they both think that
>>>> the other side is dead.
>>>
>>> Resolving this is the job of the HA harness, as I said ... the usual
>>> solution being either third node pings or confirmable switchover.
>>
>> and none of those solutions are failsafe in a distributed environment (in
>> a local environment you can have a race to see which system powers off the
>> other first to ensure that at most one is running, but you can't do that
>> reliably remotely)
>
> Um, yes they are, that's why they're used.
>
> Do you understand how they work?
>
> Third node ping means that there has to be an external third node acting
> as mediator (like a quorum device) ... usually in a third location.  A
> node surviving has to make contact with it before failover can proceed
> automatically (the running node has to be in contact to keep running).

this is what I understood, there are many cases where this doesn't work 
well

> Confirmable switchover is where the cluster detects the failure and
> pages an admin to check on the remote and confirm or deny the switch
> over manually.  Without the confirmation it just waits.

this I did not understand

> Both of these mechanisms are robust to split brain.  By and large most
> enterprises I've seen go for confirmable switchover, but some do
> implement third node ping.

it depends on how much tolerance teh business has for things to be down as 
a result of a problem with the third node (including communications to 
it) and how long they are willing to be down while waiting for a sysadmin 
to be paged

>>> This corruption situation isn't unique to replication ... any time you
>>> may potentially have allowed both sides to write to a data store, you
>>> get it, that's why it's the job of the HA harness to sort out whether a
>>> split brain happened and what to do about it *first*.
>>
>> but you can have packets sitting in the network buffers waiting to get to
>> the remote machine, then once the connection is reestablished those
>> packets will go out. no remounting needed., just connectivity restored.
>> (this isn't as bad as if the system tries to re-sync to the temprarily
>> unavailable drive by itself, but it can still corrupt things)
>
> This is an interesting thought, but not what happens.  As soon as the HA
> harness stops replication, which it does at the instant failure is
> detected, the closure of the socket kills all the in flight network
> data.
>
> There is an variant of this problem that occurs with device mapper
> queue_if_no_path (on local disks) which does exactly what you say (keeps
> unsaved data around in the queue forever), but that's fixed by not using
> queue_if_no_path for HA.  Maybe that's what you were thinking of?

is there a mechanism in ndb that prevents it from beign mounted more than 
once? if so then could have the same protection that DRDB has, if not it 
is possible for it to be mounted more than once place and therefor get 
corrupted.

>> a cluster spread across different locations has problems to face that a
>> cluster within easy cabling distance does not.
>>
>> DRDB has been extensivly tested and build to survive in the harsher
>> environment.
>
> There are commercial HA products based on md/nbd, so I'd say it's also
> hardened for harsher environments

which ones?

David Lang

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 15:48                       ` david
@ 2009-05-03 16:02                         ` James Bottomley
  2009-05-03 16:13                           ` david
  0 siblings, 1 reply; 44+ messages in thread
From: James Bottomley @ 2009-05-03 16:02 UTC (permalink / raw)
  To: david
  Cc: Willy Tarreau, Bart Van Assche, Andrew Morton, Philipp Reisner,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

On Sun, 2009-05-03 at 08:48 -0700, david@lang.hm wrote:
> On Sun, 3 May 2009, James Bottomley wrote:
> 
> > On Sun, 2009-05-03 at 08:22 -0700, david@lang.hm wrote:
> >> On Sun, 3 May 2009, James Bottomley wrote:
> >>
> >>>> On Sun, 3 May 2009, James Bottomley wrote:
> >>>>
> >>>>> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> >>>>>
> >>>>> On Sun, 2009-05-03 at 07:36 -0700, david@lang.hm wrote:
> >>>>>> On Sun, 3 May 2009, James Bottomley wrote:
> >>>>>>
> >>>>>>> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> >>>>>>>
> >>>>>>> On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
> >>>>>>>> On Sun, 3 May 2009, Willy Tarreau wrote:
> >>>>>>>>
> >>>>>>>>> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
> >>>>>>>>>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
> >>>>>>>>>> <akpm@linux-foundation.org> wrote:
> >>>>>>>>>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> This is a repost of DRBD
> >>>>>>>>>>>
> >>>>>>>>>>> Is it being used anywhere for anything?  If so, where and what?
> >>>>>>>>>>
> >>>>>>>>>> One popular application is to run iSCSI and HA software on top of DRBD
> >>>>>>>>>> in order to build a highly available iSCSI storage target.
> >>>>>>>>>
> >>>>>>>>> Confirmed, I have several customers who're doing exactly that.
> >>>>>>>>
> >>>>>>>> I will also say that there are a lot of us out here who would have a use
> >>>>>>>> for DRDB in our HA setups, but have held off implementing it specificly
> >>>>>>>> because it's not yet in the upstream kernel.
> >>>>>>>
> >>>>>>> Actually, that's not a particularly strong reason because we already
> >>>>>>> have an in-kernel replicator that has much of the functionality of drbd
> >>>>>>> that you could use.  The main reason for wanting drbd in kernel is that
> >>>>>>> it has a *current* user base.
> >>>>>>>
> >>>>>>> Both the in kernel md/nbd and drbd do sync and async replication with
> >>>>>>> primary side bitmaps.  The main differences are:
> >>>>>>>
> >>>>>>>      * md/nbd can do 1 to N replication,
> >>>>>>>      * drbd can do active/active replication (useful for cluster
> >>>>>>>        filesystems)
> >>>>>>>      * The chunk size of the md/nbd is tunable
> >>>>>>>      * With the updated nbd-tools, current md/nbd can do point in time
> >>>>>>>        rollback on transaction logged secondaries (a BCS requirement)
> >>>>>>>      * drbd manages the mirror state explicitly, md/nbd needs a user
> >>>>>>>        space helper
> >>>>>>>
> >>>>>>> And probably a few others I forget.
> >>>>>>
> >>>>>> one very big one:
> >>>>>>
> >>>>>> DRDB has better support for dealing with split brain situations and
> >>>>>> recovering from them.
> >>>>>
> >>>>> I don't really think so.  The decision about which (or if a) node should
> >>>>> be killed lies with the HA harness outside of the province of the
> >>>>> replication.
> >>>>>
> >>>>> One could argue that the symmetric active mode of drbd allows both nodes
> >>>>> to continue rather than having the harness make a kill decision about
> >>>>> one.  However, if they both alter the same data, you get an
> >>>>> irreconcilable data corruption fault which, one can argue, is directly
> >>>>> counter to HA principles and so allowing drbd continuation is arguably
> >>>>> the wrong thing to do.
> >>>>
> >>>> but the issue is that at the time the failure is taking place, neither
> >>>> side _knows_ that the other side is running. In fact, they both think that
> >>>> the other side is dead.
> >>>
> >>> Resolving this is the job of the HA harness, as I said ... the usual
> >>> solution being either third node pings or confirmable switchover.
> >>
> >> and none of those solutions are failsafe in a distributed environment (in
> >> a local environment you can have a race to see which system powers off the
> >> other first to ensure that at most one is running, but you can't do that
> >> reliably remotely)
> >
> > Um, yes they are, that's why they're used.
> >
> > Do you understand how they work?
> >
> > Third node ping means that there has to be an external third node acting
> > as mediator (like a quorum device) ... usually in a third location.  A
> > node surviving has to make contact with it before failover can proceed
> > automatically (the running node has to be in contact to keep running).
> 
> this is what I understood, there are many cases where this doesn't work 
> well

You mean there are situations where both can be down?  Sure, but a)
they're rare and b) it's still not a split brain.

> > Confirmable switchover is where the cluster detects the failure and
> > pages an admin to check on the remote and confirm or deny the switch
> > over manually.  Without the confirmation it just waits.
> 
> this I did not understand
> 
> > Both of these mechanisms are robust to split brain.  By and large most
> > enterprises I've seen go for confirmable switchover, but some do
> > implement third node ping.
> 
> it depends on how much tolerance teh business has for things to be down as 
> a result of a problem with the third node (including communications to 
> it) and how long they are willing to be down while waiting for a sysadmin 
> to be paged

Usually for geo disaster type situations, the recovery plans I've seen
actually *require* manual intervention (likely because they don't fully
trust their HA suppliers, of course ...)

> >>> This corruption situation isn't unique to replication ... any time you
> >>> may potentially have allowed both sides to write to a data store, you
> >>> get it, that's why it's the job of the HA harness to sort out whether a
> >>> split brain happened and what to do about it *first*.
> >>
> >> but you can have packets sitting in the network buffers waiting to get to
> >> the remote machine, then once the connection is reestablished those
> >> packets will go out. no remounting needed., just connectivity restored.
> >> (this isn't as bad as if the system tries to re-sync to the temprarily
> >> unavailable drive by itself, but it can still corrupt things)
> >
> > This is an interesting thought, but not what happens.  As soon as the HA
> > harness stops replication, which it does at the instant failure is
> > detected, the closure of the socket kills all the in flight network
> > data.
> >
> > There is an variant of this problem that occurs with device mapper
> > queue_if_no_path (on local disks) which does exactly what you say (keeps
> > unsaved data around in the queue forever), but that's fixed by not using
> > queue_if_no_path for HA.  Maybe that's what you were thinking of?
> 
> is there a mechanism in ndb that prevents it from beign mounted more than 
> once? if so then could have the same protection that DRDB has, if not it 
> is possible for it to be mounted more than once place and therefor get 
> corrupted.

That's not really relevant, is it?  An ordinary disk doesn't have this
property either.  Mediating simultaneous access is the job of the HA
harness.  If the device does it for you, fine, the harness can make use
of that (as long as the device gets it right) but all good HA harnesses
sort out the usual case where the device doesn't do it.

> >> a cluster spread across different locations has problems to face that a
> >> cluster within easy cabling distance does not.
> >>
> >> DRDB has been extensivly tested and build to survive in the harsher
> >> environment.
> >
> > There are commercial HA products based on md/nbd, so I'd say it's also
> > hardened for harsher environments
> 
> which ones?

SteelEye LifeKeeper.  It actually supports both drbd and md/nbd.

James




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 16:02                         ` James Bottomley
@ 2009-05-03 16:13                           ` david
  0 siblings, 0 replies; 44+ messages in thread
From: david @ 2009-05-03 16:13 UTC (permalink / raw)
  To: James Bottomley
  Cc: Willy Tarreau, Bart Van Assche, Andrew Morton, Philipp Reisner,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

On Sun, 3 May 2009, James Bottomley wrote:

> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> 
> On Sun, 2009-05-03 at 08:48 -0700, david@lang.hm wrote:
>> On Sun, 3 May 2009, James Bottomley wrote:
>>
>>> On Sun, 2009-05-03 at 08:22 -0700, david@lang.hm wrote:
>>>> On Sun, 3 May 2009, James Bottomley wrote:
>>>>
>
>>>>> This corruption situation isn't unique to replication ... any time you
>>>>> may potentially have allowed both sides to write to a data store, you
>>>>> get it, that's why it's the job of the HA harness to sort out whether a
>>>>> split brain happened and what to do about it *first*.
>>>>
>>>> but you can have packets sitting in the network buffers waiting to get to
>>>> the remote machine, then once the connection is reestablished those
>>>> packets will go out. no remounting needed., just connectivity restored.
>>>> (this isn't as bad as if the system tries to re-sync to the temprarily
>>>> unavailable drive by itself, but it can still corrupt things)
>>>
>>> This is an interesting thought, but not what happens.  As soon as the HA
>>> harness stops replication, which it does at the instant failure is
>>> detected, the closure of the socket kills all the in flight network
>>> data.
>>>
>>> There is an variant of this problem that occurs with device mapper
>>> queue_if_no_path (on local disks) which does exactly what you say (keeps
>>> unsaved data around in the queue forever), but that's fixed by not using
>>> queue_if_no_path for HA.  Maybe that's what you were thinking of?
>>
>> is there a mechanism in ndb that prevents it from beign mounted more than
>> once? if so then could have the same protection that DRDB has, if not it
>> is possible for it to be mounted more than once place and therefor get
>> corrupted.
>
> That's not really relevant, is it?  An ordinary disk doesn't have this
> property either.  Mediating simultaneous access is the job of the HA
> harness.  If the device does it for you, fine, the harness can make use
> of that (as long as the device gets it right) but all good HA harnesses
> sort out the usual case where the device doesn't do it.

with a local disk you can mount it multiple times, write to it from all 
the mounts, and not have any problems, because all access goes through a 
common layer.

you would have this sort of problem if you used one partition as part of 
multiple md arrays, but the md layer itself would detect and prevent this 
(because it would see both arrays), but again, in a multi-machine 
situation you don't have the common layer to do the detection.

you can rely on the HA layer to detect and prevent all of this (and 
apparently there are people doing this, I wasn't aware of it), but I've 
seen enough problems with every HA implementation I've dealt with over the 
years (both opensource and commercial), that I would be very uncomfortable 
depending on this exclusivly. having the disk replication layer detect 
this adds a significant amount of safety in my eyes.

>>> There are commercial HA products based on md/nbd, so I'd say it's also
>>> hardened for harsher environments
>>
>> which ones?
>
> SteelEye LifeKeeper.  It actually supports both drbd and md/nbd.

thanks for the info.

David Lang

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 11:00     ` Neil Brown
@ 2009-05-03 21:32       ` Lars Ellenberg
  2009-05-04 16:12         ` Lars Marowsky-Bree
  2009-05-05 22:08         ` Lars Ellenberg
  0 siblings, 2 replies; 44+ messages in thread
From: Lars Ellenberg @ 2009-05-03 21:32 UTC (permalink / raw)
  To: Neil Brown
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche

On Sun, May 03, 2009 at 09:00:45PM +1000, Neil Brown wrote:
> On Sunday May 3, lars.ellenberg@linbit.com wrote:
> > > If there some strong technical reason to only allow 2 nodes?
> > 
> > It "just" has not yet been implemented.
> > I'm working on that, though.
> 
> :-)
> 
> > 
> > > >     How do you fit that into a RAID1+NBD model ? NBD is just a block
> > > >     transport, it does not offer the ability to exchange dirty bitmaps or
> > > >     data generation identifiers, nor does the RAID1 code has a concept of
> > > >     that.
> > > 
> > > Not 100% true, but I - at least partly -  get your point.
> > > As md stores bitmaps and data generation identifiers on the block
> > > device, these can be transferred over NBD just like any other data on
> > > the block device.
> > 
> > Do you have one dirty bitmap per mirror (yet) ?
> > Do you _merge_ them?
> 
> md doesn't merge bitmaps yet.  However if I found a need to, I would
> simple read a bitmap in userspace and feed it into the kernel via 
> 	/sys/block/mdX/md/md/bitmap_set_bits

ah, ok.  right.  that would do it.

> We sort-of have one bitmap per mirror, but only because the one bitmap
> is mirrored...

Which it could not be while replication link is down,
so once replication link is back (or remote node is back,
which is not easily distinguishable just there, blablabla),
you'd need to fetch the remote bitmap, and merge it with the local
bitmap (feeding it into bitmap_set_bits),
then re-attach the "failed" mirror.

The reasoning in the commit 9b1d1dac181d8c1b9492e05cee660a985d035a06,
which adds that feature, exactly describes this use case.

There, again, our simple run-length encoding scheme does make very much
sense, as the numbers dropping out of it during decoding are exactly
the runlengths, and could be fed into this almost directly.

> > the "NBD" mirrors are remote, and once you lose communication,
> > they may be (and in general, you have to assume they are) modified
> > by which ever node they are directly attached to.
> > 
> > > However I think that part of your point is that DRBD can transfer them
> > > more efficiently (e.g. it compresses the bitmap before transferring it
> > > -  I assume the compression you use is much more effective than gzip??
> > > else why both to code your own).
> > 
> > No, the point was that we have one bitmap per mirror (though currently
> > number of mirrors == 2, only), and that we do merge them.
> 
> Right.  I imagine much of the complexity of that could be handled in
> user-space while setting an a DRBD instance (??).

possibly.
you'd need to involve these steps on each and every communication loss
and network handshake.  I think that would make the system slower to
react on e.g. "flaky" replication links.

you are thinking in the "MD" paradigma: at any point in time, there is
only one MD instance involved, the mirror transports (currently dumb
block devices) simply do what they are told.

in DRBD, we have multiple (ok, two) instances talking to each other,
and I think that is the better approach for (remote) replication.

> > but to answer the question:
> > why bother to implement our own encoding?
> > because we know a lot about the data to be encoded.
> > 
> > the compression of the bitmap transfer we just added very recently.
> > for a bitmap, with large chunks of bits set or unset, it is efficient
> > to just code the runlength.
> > to use gzip in kernel would add yet an other huge overhead for code
> > tables and so on.
> > during testing of this encoding, applying it to an already gzip'ed file
> > was able to compress it even further, btw.
> > though on english plain text, gzip compression is _much_ more effective.
> 
> I just tried a little experiment.
> I created a 128meg file and randomly set 1000 bits in it.
> I compressed it with "gzip --best" and the result was 4Meg.  Not
> particularly impressive.
> I then tried to compress it wit bzip2 and got 3452 bytes.
> Now *that* is impressive.  I suspect your encoding might do a little
> better, but I wonder if it is worth the effort.

The effort is minimal.
The cpu overhead is negligible (compared with bzip2, or any other
generic compression scheme), and the memory overhead is next to none
(just a small scratch buffer, to assemble the network packet).
No tables or anything involved.
Especially the _decoding_ part has this nice property:
  chunk = 0;
  while (!eof) {
	vli_decode_bits(&rl, input); /* number of unset bits */
	chunk += rl;
	vli_decode_bits(&rl, input); /* number of set bits */
	bitmap_dirty_bits(bitmap, chunk, chunk + rl);
	chunk += rl;
 }

The source code is there.

For your example, on average you'd have (128 << 23) / 1000 "clear" bits,
then one set bit. The encoding transfers
"first bit unset -- ca. (1<<20), 1, ca. (1<<20), 1, ca. (1<<20), 1, ...",
using 2 bits for the "1", and up to 29 bit for the "ca. 1<<20".
should be in the very same ballpark as your bzip2 result.

> I'm not certain that my test file is entirely realistic, but it is
> still an interesting experiment.

It is not ;) but still...
If you are interessted, I can dig up my throw away user land code,
that has been used to evaluate various such schemes.
But it is so ugly that I won't post it to lkml.

> Why do you do this compression in the kernel?  It seems to me that it
> would be quite practical to do it all in user-space, thus making it
> really easy to use pre-existing libraries.

Because the bitmap exchange happens in kernel.

If considering to rewrite a replication solution,
one can start to reconsider design choices.

But DRBD as of now does the connection handshake and bitmap exchange in
kernel.  We wanted to have a fast compression scheme suitable for
bitmaps, without cpu or memory overhead.  This does it quite nicely.

I can dig up my userland throwaway code used during evaluation
of various encoding schemes again, if you are interessted.

> BTW, the kernel already contains various compression code as part of
> the crypto API.

Of course I know.  But you are not really suggesting that I should do
bzip2 in kernel to exchange the bitmap. And on decoding, I want those
runlengths, not the actual plain bitmap.

> > > You say "nor does the RAID1 code has a concept of that".  It isn't
> > > clear what you are referring to.
> > 
> > The concept that one of the mirrors (the "nbd" one in that picture)
> > may have been accessed independently, without MD knowning,
> > because the node this MD (and its "local" mirror) was living on
> > suffered from power outage.

or the link has been down,
and the remote side decided to go active with it.

or the link has been taken down,
to activate the other side, knowingly creating a data set divergence,
to do some off-site processing.

> > The concept of both mirrors being modified _simultaneously_,
> > (e.g. living below a cluster file system).
> 
> Yes, that is an important concept.  Certainly one of the bits that
> would need to be added to md.
> 
> > > Whether the current DRBD code gets merged or not is possibly a
> > > separate question, though I would hope that if we followed the path of
> > > merging DRBD into md/raid1, then any duplicate code would eventually be
> > > excised from the kernel.
> > 
> > Rumor [http://lwn.net/Articles/326818/] has it, that the various in
> > kernel raid implementations are being unified right now, anyways?
> 
> I'm not holding my breath on that one...  
> I think that merging DRBD with md/raid1 would be significantly easier
> that any sort of merge between md and dm.  But (in either case) I'll
> do what I can to assist any effort that is technically sound.

D'accord.

> > If you want to stick to "replication is almost identical to RAID1",
> > best not to forget "this may be a remote mirror", there may be more than
> > one entity accessing it, this may be part of a bi-directional
> > (active-active) replication setup.
> > 
> > For further ideas on what could be done with replication (enhancing the
> > strict "raid1" notion), see also
> > http://www.drbd.org/fileadmin/drbd/publications/drbd9.linux-kongress.2008.pdf
> > 
> >  - time shift replication
> >  - generic point in time recovery of block device data
> >  - (remote) backup by periodically, round-robin re-sync of
> >    "raid" members, then "dropping" them again.
> >  ...
> > 
> > No useable code on those ideas, yet,
> > but a lot of thought. It is not all handwaving.
> 
> :-)
> 
> I'll have to do a bit of reading I see.  I'll then try to rough out a
> design and plan for merging DRBD functionality with md/raid1.  At the
> very least that would give me enough background understanding to be
> able to sensibly review your code submission.

Thanks.  Please give particular attention to the "taxonomy paper"
referenced therein, so we are going to use the same terms.

	Lars

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 14:45             ` James Bottomley
  2009-05-03 14:56               ` david
@ 2009-05-04  8:28               ` Philipp Reisner
  2009-05-04 17:24                 ` James Bottomley
  1 sibling, 1 reply; 44+ messages in thread
From: Philipp Reisner @ 2009-05-04  8:28 UTC (permalink / raw)
  To: James Bottomley
  Cc: david, Willy Tarreau, Bart Van Assche, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

On Sunday 03 May 2009 16:45:25 James Bottomley wrote:
> On Sun, 2009-05-03 at 07:36 -0700, david@lang.hm wrote:
> > On Sun, 3 May 2009, James Bottomley wrote:
> > > Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> > >
> > > On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
> > >> On Sun, 3 May 2009, Willy Tarreau wrote:
> > >>> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
> > >>>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
> > >>>>
> > >>>> <akpm@linux-foundation.org> wrote:
> > >>>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner 
<philipp.reisner@linbit.com> wrote:
> > >>>>>> This is a repost of DRBD
> > >>>>>
> > >>>>> Is it being used anywhere for anything?  If so, where and what?
> > >>>>
> > >>>> One popular application is to run iSCSI and HA software on top of
> > >>>> DRBD in order to build a highly available iSCSI storage target.
> > >>>
> > >>> Confirmed, I have several customers who're doing exactly that.
> > >>
> > >> I will also say that there are a lot of us out here who would have a
> > >> use for DRDB in our HA setups, but have held off implementing it
> > >> specificly because it's not yet in the upstream kernel.
> > >
> > > Actually, that's not a particularly strong reason because we already
> > > have an in-kernel replicator that has much of the functionality of drbd
> > > that you could use.  The main reason for wanting drbd in kernel is that
> > > it has a *current* user base.
> > >
> > > Both the in kernel md/nbd and drbd do sync and async replication with
> > > primary side bitmaps.  The main differences are:
> > >
> > >      * md/nbd can do 1 to N replication,
> > >      * drbd can do active/active replication (useful for cluster
> > >        filesystems)
> > >      * The chunk size of the md/nbd is tunable
> > >      * With the updated nbd-tools, current md/nbd can do point in time
> > >        rollback on transaction logged secondaries (a BCS requirement)
> > >      * drbd manages the mirror state explicitly, md/nbd needs a user
> > >        space helper
> > >
> > > And probably a few others I forget.
> >
> > one very big one:
> >
> > DRDB has better support for dealing with split brain situations and
> > recovering from them.
>
> I don't really think so.  The decision about which (or if a) node should
> be killed lies with the HA harness outside of the province of the
> replication.
>
> One could argue that the symmetric active mode of drbd allows both nodes
> to continue rather than having the harness make a kill decision about
> one.  However, if they both alter the same data, you get an
> irreconcilable data corruption fault which, one can argue, is directly
> counter to HA principles and so allowing drbd continuation is arguably
> the wrong thing to do.
>

When you do asynchronous replication, how do you ensure that implicit
write-after-write dependencies in the stream of writes you get from
the file system above, are not violated on the secondary ?

There might be a disk scheduler on the secondary.

-Phil
-- 
: Dipl-Ing Philipp Reisner
: LINBIT | Your Way to High Availability
: Tel: +43-1-8178292-50, Fax: +43-1-8178292-82
: http://www.linbit.com

DRBD(R) and LINBIT(R) are registered trademarks of LINBIT, Austria.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 21:32       ` Lars Ellenberg
@ 2009-05-04 16:12         ` Lars Marowsky-Bree
  2009-05-05 22:08         ` Lars Ellenberg
  1 sibling, 0 replies; 44+ messages in thread
From: Lars Marowsky-Bree @ 2009-05-04 16:12 UTC (permalink / raw)
  To: Lars Ellenberg, Neil Brown
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche

On 2009-05-03T23:32:31, Lars Ellenberg <lars.ellenberg@linbit.com> wrote:

> Which it could not be while replication link is down,
> so once replication link is back (or remote node is back,
> which is not easily distinguishable just there, blablabla),
> you'd need to fetch the remote bitmap, and merge it with the local
> bitmap (feeding it into bitmap_set_bits),
> then re-attach the "failed" mirror.

Note that this sacrifices transactional consistency on the sync target;
an understandable trade-off (versus recording the stream of writes
entirely, which consumes space and possibly more resync bandwidth), but
a noteworthy one.

> But DRBD as of now does the connection handshake and bitmap exchange in
> kernel.  We wanted to have a fast compression scheme suitable for
> bitmaps, without cpu or memory overhead.  This does it quite nicely.

Sharing the connection between meta- and regular data also avoids some
ordering issues between channels, which probably helps simplify some
aspects of drbd.

Conceivably, the kernel could escalate such metadata/out-of-band
communications to user-space for handling, and user-space would then
afterwards instruct the continuation of the stream processing.

> or the link has been down,
> and the remote side decided to go active with it.

That is arguably a horrible failure on behalf of the cluster stack being
used, but indeed something drbd must be able to recover from.

Regards,
    Lars

-- 
SuSE Labs, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-04  8:28               ` Philipp Reisner
@ 2009-05-04 17:24                 ` James Bottomley
  2009-05-05  8:21                   ` Philipp Reisner
  0 siblings, 1 reply; 44+ messages in thread
From: James Bottomley @ 2009-05-04 17:24 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: david, Willy Tarreau, Bart Van Assche, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

On Mon, 2009-05-04 at 10:28 +0200, Philipp Reisner wrote:
> On Sunday 03 May 2009 16:45:25 James Bottomley wrote:
> > On Sun, 2009-05-03 at 07:36 -0700, david@lang.hm wrote:
> > > On Sun, 3 May 2009, James Bottomley wrote:
> > > > Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> > > >
> > > > On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
> > > >> On Sun, 3 May 2009, Willy Tarreau wrote:
> > > >>> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
> > > >>>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
> > > >>>>
> > > >>>> <akpm@linux-foundation.org> wrote:
> > > >>>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner 
> <philipp.reisner@linbit.com> wrote:
> > > >>>>>> This is a repost of DRBD
> > > >>>>>
> > > >>>>> Is it being used anywhere for anything?  If so, where and what?
> > > >>>>
> > > >>>> One popular application is to run iSCSI and HA software on top of
> > > >>>> DRBD in order to build a highly available iSCSI storage target.
> > > >>>
> > > >>> Confirmed, I have several customers who're doing exactly that.
> > > >>
> > > >> I will also say that there are a lot of us out here who would have a
> > > >> use for DRDB in our HA setups, but have held off implementing it
> > > >> specificly because it's not yet in the upstream kernel.
> > > >
> > > > Actually, that's not a particularly strong reason because we already
> > > > have an in-kernel replicator that has much of the functionality of drbd
> > > > that you could use.  The main reason for wanting drbd in kernel is that
> > > > it has a *current* user base.
> > > >
> > > > Both the in kernel md/nbd and drbd do sync and async replication with
> > > > primary side bitmaps.  The main differences are:
> > > >
> > > >      * md/nbd can do 1 to N replication,
> > > >      * drbd can do active/active replication (useful for cluster
> > > >        filesystems)
> > > >      * The chunk size of the md/nbd is tunable
> > > >      * With the updated nbd-tools, current md/nbd can do point in time
> > > >        rollback on transaction logged secondaries (a BCS requirement)
> > > >      * drbd manages the mirror state explicitly, md/nbd needs a user
> > > >        space helper
> > > >
> > > > And probably a few others I forget.
> > >
> > > one very big one:
> > >
> > > DRDB has better support for dealing with split brain situations and
> > > recovering from them.
> >
> > I don't really think so.  The decision about which (or if a) node should
> > be killed lies with the HA harness outside of the province of the
> > replication.
> >
> > One could argue that the symmetric active mode of drbd allows both nodes
> > to continue rather than having the harness make a kill decision about
> > one.  However, if they both alter the same data, you get an
> > irreconcilable data corruption fault which, one can argue, is directly
> > counter to HA principles and so allowing drbd continuation is arguably
> > the wrong thing to do.
> >
> 
> When you do asynchronous replication, how do you ensure that implicit
> write-after-write dependencies in the stream of writes you get from
> the file system above, are not violated on the secondary ?

Are you telling me drbd doesn't currently do this?

The way nbd does it (in the updated tools is to use DIRECT_IO and
fsync).

> There might be a disk scheduler on the secondary.

There usually is a disk scheduler ... you just have to take the required
action to persuade it to preserve ordering ... a simplistic way of doing
this is to switch to the noop scheduler.

James



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-01 11:15   ` Lars Marowsky-Bree
  2009-05-01 13:14     ` Dave Jones
@ 2009-05-05  4:05     ` Christian Kujau
  1 sibling, 0 replies; 44+ messages in thread
From: Christian Kujau @ 2009-05-05  4:05 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: Andrew Morton, Philipp Reisner, LKML, Jens Axboe, Greg KH,
	Neil Brown, James Bottomley, Sam Ravnborg, Dave Jones,
	Nikanth Karthikesan, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche, Lars Ellenberg

On Fri, 1 May 2009, Lars Marowsky-Bree wrote:
> It is used by many customers (thousands world-wide, I'm sure) to
> replicate block device data locally (to replace more expensive SANs
> while achieving higher availablity) or async/remotely (for disaster
> recovery).

While covering Linux-HA success stories really, most of them are using 
DRBD as well: http://moin.linux-ha.org/lha/SuccessStories

C.
-- 
Bruce Schneier does not sleep. He prempts everything.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-04 17:24                 ` James Bottomley
@ 2009-05-05  8:21                   ` Philipp Reisner
  2009-05-05 14:09                     ` James Bottomley
  2009-05-05 15:03                     ` Bart Van Assche
  0 siblings, 2 replies; 44+ messages in thread
From: Philipp Reisner @ 2009-05-05  8:21 UTC (permalink / raw)
  To: James Bottomley
  Cc: david, Willy Tarreau, Bart Van Assche, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

On Monday 04 May 2009 19:24:11 James Bottomley wrote:
> On Mon, 2009-05-04 at 10:28 +0200, Philipp Reisner wrote:
> > On Sunday 03 May 2009 16:45:25 James Bottomley wrote:
> > > On Sun, 2009-05-03 at 07:36 -0700, david@lang.hm wrote:
> > > > On Sun, 3 May 2009, James Bottomley wrote:
> > > > > Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
> > > > >
> > > > > On Sat, 2009-05-02 at 22:40 -0700, david@lang.hm wrote:
> > > > >> On Sun, 3 May 2009, Willy Tarreau wrote:
> > > > >>> On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
> > > > >>>> On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
> > > > >>>>
> > > > >>>> <akpm@linux-foundation.org> wrote:
> > > > >>>>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner
> >
> > <philipp.reisner@linbit.com> wrote:
> > > > >>>>>> This is a repost of DRBD
> > > > >>>>>
> > > > >>>>> Is it being used anywhere for anything?  If so, where and what?
> > > > >>>>
> > > > >>>> One popular application is to run iSCSI and HA software on top
> > > > >>>> of DRBD in order to build a highly available iSCSI storage
> > > > >>>> target.
> > > > >>>
> > > > >>> Confirmed, I have several customers who're doing exactly that.
> > > > >>
> > > > >> I will also say that there are a lot of us out here who would have
> > > > >> a use for DRDB in our HA setups, but have held off implementing it
> > > > >> specificly because it's not yet in the upstream kernel.
> > > > >
> > > > > Actually, that's not a particularly strong reason because we
> > > > > already have an in-kernel replicator that has much of the
> > > > > functionality of drbd that you could use.  The main reason for
> > > > > wanting drbd in kernel is that it has a *current* user base.
> > > > >
> > > > > Both the in kernel md/nbd and drbd do sync and async replication
> > > > > with primary side bitmaps.  The main differences are:
> > > > >
> > > > >      * md/nbd can do 1 to N replication,
> > > > >      * drbd can do active/active replication (useful for cluster
> > > > >        filesystems)
> > > > >      * The chunk size of the md/nbd is tunable
> > > > >      * With the updated nbd-tools, current md/nbd can do point in
> > > > > time rollback on transaction logged secondaries (a BCS requirement)
> > > > > * drbd manages the mirror state explicitly, md/nbd needs a user
> > > > > space helper
> > > > >
> > > > > And probably a few others I forget.
> > > >
> > > > one very big one:
> > > >
> > > > DRDB has better support for dealing with split brain situations and
> > > > recovering from them.
> > >
> > > I don't really think so.  The decision about which (or if a) node
> > > should be killed lies with the HA harness outside of the province of
> > > the replication.
> > >
> > > One could argue that the symmetric active mode of drbd allows both
> > > nodes to continue rather than having the harness make a kill decision
> > > about one.  However, if they both alter the same data, you get an
> > > irreconcilable data corruption fault which, one can argue, is directly
> > > counter to HA principles and so allowing drbd continuation is arguably
> > > the wrong thing to do.
> >
> > When you do asynchronous replication, how do you ensure that implicit
> > write-after-write dependencies in the stream of writes you get from
> > the file system above, are not violated on the secondary ?
>
> Are you telling me drbd doesn't currently do this?
>

No I am not. DRBD does exactly this!
But I am wondering how that is achieved in the MD/NBD stack when running 
in async mode.

The issue is covered since the early days in DRBD, (back in 2000).
The issue, and the solution we have in DRBD is described in this paper:

http://www.drbd.org/fileadmin/drbd/publications/drbd_paper_for_NLUUG_2001.pdf

> The way nbd does it (in the updated tools is to use DIRECT_IO and
> fsync).

Is that available in the existing tools ? -- Are the updated tools
something that will be available in the future ?

Are you telling me md/ndb (async) doesn't currently do this ?

> > There might be a disk scheduler on the secondary.
>
> There usually is a disk scheduler ... you just have to take the required
> action to persuade it to preserve ordering ... a simplistic way of doing
> this is to switch to the noop scheduler.

The issue actually goes further down the stack. Not only the in kernel
disk scheduler might reorder something, also the driver and finally the
drive might do so.

What we have in DRBD boils down to:

* We obey all possible write after write dependencies in the stream of
  writes we get from the upper layers. And generate DRBD internal
  reorder barriers for the packet stream.
* On the secondary node we impose these barriers onto the stream of writes
  submitted to the stack below us by either:

   - Let previously submitted write-IO drain before we submit write-IO after
     such an DRBD barrier. (That we have since 2000 or so)

   - Additionally issue a blkdev_issue_flush()

   - Use write requests with BIO_RW_BARRIER. This method has two advantages:
     We can continue to submit writes after the DRBD internal barrier
     immediately, and the number of requests with BIO_RW_BARRIER can be
     further reduced. 
     See section 6 of
     http://www.drbd.org/fileadmin/drbd/publications/drbd8.pdf
     for more details, and nice illustrations.

     Unfortunately only high end SAN devices seem to benefit from this
     method. For most in-machine-disk controlers this method does not
     achieve the highest throughput.

Expressed in other words: 
We allow reordering on the secondary node to an extend so that we can
guarantee that no implicit write-after-write dependencies are violated.

Coming back to the idea of disabling the in Linux IO scheduler. It might
solve the issue for some devices, but it does not guarantee to solve it.

-Phil
-- 
: Dipl-Ing Philipp Reisner
: LINBIT | Your Way to High Availability
: Tel: +43-1-8178292-50, Fax: +43-1-8178292-82
: http://www.linbit.com

DRBD(R) and LINBIT(R) are registered trademarks of LINBIT, Austria.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-05  8:21                   ` Philipp Reisner
@ 2009-05-05 14:09                     ` James Bottomley
  2009-05-05 15:56                       ` Philipp Reisner
  2009-05-05 15:03                     ` Bart Van Assche
  1 sibling, 1 reply; 44+ messages in thread
From: James Bottomley @ 2009-05-05 14:09 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: david, Willy Tarreau, Bart Van Assche, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

On Tue, 2009-05-05 at 10:21 +0200, Philipp Reisner wrote:
> > > When you do asynchronous replication, how do you ensure that implicit
> > > write-after-write dependencies in the stream of writes you get from
> > > the file system above, are not violated on the secondary ?
> >
> > Are you telling me drbd doesn't currently do this?
> >
> 
> No I am not. DRBD does exactly this!
> But I am wondering how that is achieved in the MD/NBD stack when running 
> in async mode.

The explanation is below.

> The issue is covered since the early days in DRBD, (back in 2000).
> The issue, and the solution we have in DRBD is described in this paper:
> 
> http://www.drbd.org/fileadmin/drbd/publications/drbd_paper_for_NLUUG_2001.pdf
> 
> > The way nbd does it (in the updated tools is to use DIRECT_IO and
> > fsync).
> 
> Is that available in the existing tools ? -- Are the updated tools
> something that will be available in the future ?

It's in the existing.

> Are you telling me md/ndb (async) doesn't currently do this ?

I just described how it doe this ... I don't quite see how that
translates into telling you it doesn't do this.

> > > There might be a disk scheduler on the secondary.
> >
> > There usually is a disk scheduler ... you just have to take the required
> > action to persuade it to preserve ordering ... a simplistic way of doing
> > this is to switch to the noop scheduler.
> 
> The issue actually goes further down the stack. Not only the in kernel
> disk scheduler might reorder something, also the driver and finally the
> drive might do so.
> 
> What we have in DRBD boils down to:
> 
> * We obey all possible write after write dependencies in the stream of
>   writes we get from the upper layers. And generate DRBD internal
>   reorder barriers for the packet stream.
> * On the secondary node we impose these barriers onto the stream of writes
>   submitted to the stack below us by either:
> 
>    - Let previously submitted write-IO drain before we submit write-IO after
>      such an DRBD barrier. (That we have since 2000 or so)
> 
>    - Additionally issue a blkdev_issue_flush()
> 
>    - Use write requests with BIO_RW_BARRIER. This method has two advantages:
>      We can continue to submit writes after the DRBD internal barrier
>      immediately, and the number of requests with BIO_RW_BARRIER can be
>      further reduced. 
>      See section 6 of
>      http://www.drbd.org/fileadmin/drbd/publications/drbd8.pdf
>      for more details, and nice illustrations.

THere's a slight error in there ... we don't use ordered tags for
barriers (yet).  I don't think it will really matter because the main
domain of ordering problems is the scheduler, which REQ_BARRIER does
cope with, it just means the queue drains for a barrier.

>      Unfortunately only high end SAN devices seem to benefit from this
>      method. For most in-machine-disk controlers this method does not
>      achieve the highest throughput.
> 
> Expressed in other words: 
> We allow reordering on the secondary node to an extend so that we can
> guarantee that no implicit write-after-write dependencies are violated.
> 
> Coming back to the idea of disabling the in Linux IO scheduler. It might
> solve the issue for some devices, but it does not guarantee to solve it.

I think you'll find the dio/fsync method above actually does solve all
of these issues (mainly because it enforces the semantics from top to
bottom in the stack).  I agree one could use more elaborate semantics
like you do for drbd, but since the simple ones worked efficiently for
md/nbd, there didn't seem to be much point.

James



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-05  8:21                   ` Philipp Reisner
  2009-05-05 14:09                     ` James Bottomley
@ 2009-05-05 15:03                     ` Bart Van Assche
  2009-05-05 15:57                       ` Philipp Reisner
  1 sibling, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2009-05-05 15:03 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: James Bottomley, david, Willy Tarreau, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

On Tue, May 5, 2009 at 10:21 AM, Philipp Reisner
<philipp.reisner@linbit.com> wrote:
> What we have in DRBD boils down to:
>
> * We obey all possible write after write dependencies in the stream of
>  writes we get from the upper layers. And generate DRBD internal
>  reorder barriers for the packet stream.

Hello Philipp,

I couldn't find a call to blk_queue_ordered() in the DRBD 8.3.1 source
code. This made me wonder how DRBD obtains information about barriers
that is generated by filesystems like ext3 with the option barrier=1 ?

Bart.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-05 14:09                     ` James Bottomley
@ 2009-05-05 15:56                       ` Philipp Reisner
  2009-05-05 17:05                         ` James Bottomley
  0 siblings, 1 reply; 44+ messages in thread
From: Philipp Reisner @ 2009-05-05 15:56 UTC (permalink / raw)
  To: James Bottomley
  Cc: david, Willy Tarreau, Bart Van Assche, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

On Tuesday 05 May 2009 16:09:45 James Bottomley wrote:
> On Tue, 2009-05-05 at 10:21 +0200, Philipp Reisner wrote:
> > > > When you do asynchronous replication, how do you ensure that implicit
> > > > write-after-write dependencies in the stream of writes you get from
> > > > the file system above, are not violated on the secondary ?
> > >
[...]
> > > The way nbd does it (in the updated tools is to use DIRECT_IO and
> > > fsync).
> >
[...]
> I think you'll find the dio/fsync method above actually does solve all
> of these issues (mainly because it enforces the semantics from top to
> bottom in the stack).  I agree one could use more elaborate semantics
> like you do for drbd, but since the simple ones worked efficiently for
> md/nbd, there didn't seem to be much point.
>

Do I get it right, that you enforce the exact same write order on the 
secondary node as the stream of writes was comming in on the primary?

Using either DIRECT_IO or fsync() calls ?

Is DIRECT_IO/fsync() enabled by default ?

-Phil
-- 
: Dipl-Ing Philipp Reisner
: LINBIT | Your Way to High Availability
: Tel: +43-1-8178292-50, Fax: +43-1-8178292-82
: http://www.linbit.com

DRBD(R) and LINBIT(R) are registered trademarks of LINBIT, Austria.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-05 15:03                     ` Bart Van Assche
@ 2009-05-05 15:57                       ` Philipp Reisner
  2009-05-05 17:38                         ` Lars Marowsky-Bree
  0 siblings, 1 reply; 44+ messages in thread
From: Philipp Reisner @ 2009-05-05 15:57 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: James Bottomley, david, Willy Tarreau, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

On Tuesday 05 May 2009 17:03:13 Bart Van Assche wrote:
> On Tue, May 5, 2009 at 10:21 AM, Philipp Reisner
>
> <philipp.reisner@linbit.com> wrote:
> > What we have in DRBD boils down to:
> >
> > * We obey all possible write after write dependencies in the stream of
> >  writes we get from the upper layers. And generate DRBD internal
> >  reorder barriers for the packet stream.
>
> Hello Philipp,
>
> I couldn't find a call to blk_queue_ordered() in the DRBD 8.3.1 source
> code. This made me wonder how DRBD obtains information about barriers
> that is generated by filesystems like ext3 with the option barrier=1 ?
>

Hi Bart,

I was refering to implicit write after write dependencies, that one
needs to obey when doing asynchronous replication.

Up to now we do not offer barrier support for the layers above us.
That will follow sooner or later.

Here is an example, why it is not completely trivial:

  Imagine DRBD on top of a dm-linear on both nodes. When you start,
  both dm-linear mappings sit on top of something that supports 
  barriers itself. -- Then the user replaces the backing device
  below the dm-linear on the secondary node with something that
  does not support barriers.

  When we get a write request with the BIO_RW_BARRIER flag set
  in from the FS, we submit this locally, ship it over to the
  peer and submit it there. Unfortunately it fails now with
  ENOTSUP on the peer. 

  We can not ship that error back to the upper layer, because
  our mirror is already inconsistent. We have to resubmit
  it with BIO_RW_BARRIER cleared, and other means to enforce
  write ordering...  Then tell the other node that we prefer
  to no longer accept BIO_RW_BARRIER etc...

-Phil
-- 
: Dipl-Ing Philipp Reisner
: LINBIT | Your Way to High Availability
: Tel: +43-1-8178292-50, Fax: +43-1-8178292-82
: http://www.linbit.com

DRBD(R) and LINBIT(R) are registered trademarks of LINBIT, Austria.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-05 15:56                       ` Philipp Reisner
@ 2009-05-05 17:05                         ` James Bottomley
  2009-05-05 21:45                           ` Philipp Reisner
  0 siblings, 1 reply; 44+ messages in thread
From: James Bottomley @ 2009-05-05 17:05 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: david, Willy Tarreau, Bart Van Assche, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

On Tue, 2009-05-05 at 17:56 +0200, Philipp Reisner wrote:
> On Tuesday 05 May 2009 16:09:45 James Bottomley wrote:
> > On Tue, 2009-05-05 at 10:21 +0200, Philipp Reisner wrote:
> > > > > When you do asynchronous replication, how do you ensure that implicit
> > > > > write-after-write dependencies in the stream of writes you get from
> > > > > the file system above, are not violated on the secondary ?
> > > >
> [...]
> > > > The way nbd does it (in the updated tools is to use DIRECT_IO and
> > > > fsync).
> > >
> [...]
> > I think you'll find the dio/fsync method above actually does solve all
> > of these issues (mainly because it enforces the semantics from top to
> > bottom in the stack).  I agree one could use more elaborate semantics
> > like you do for drbd, but since the simple ones worked efficiently for
> > md/nbd, there didn't seem to be much point.
> >
> 
> Do I get it right, that you enforce the exact same write order on the 
> secondary node as the stream of writes was comming in on the primary?

Um, yes ... that's the text book way of doing replication: write order
preservation.

> Using either DIRECT_IO or fsync() calls ?

Yes.

> Is DIRECT_IO/fsync() enabled by default ?

I'd have to look at the tools (and, unfortunately, there are many
variants) but it was certainly true in the variant I used.  However, the
current main use case of md/nbd is a secondary transaction log to allow
rollback anyway, so the incoming network stream is stored on the device
in write order and the problem doesn't arise.

I also think you're not quite looking at the important case: if you
think about it, the real necessity for the ordered domain is the
network, not so much the actual secondary server.  The reason is that
it's very hard to find a failure case where the write order on the
secondary from the network tap to disk actually matters (as long as the
flight into the network tap was in order).  The standard failure is of
the primary, not the secondary, so the network stream stops and so does
the secondary writing: as long as we guarantee to stop at a consistent
point in flight, everything works.  If the secondary fails while the
primary is still up, that's just a standard replay to bring the
secondary back into replication, so the issue doesn't arise there
either.

The case where it does matter is failure of the primary followed by
instantaneous failure of the secondary before the actual network stream
completes, so guaranteeing that the secondary can be brought back up
consistently.  However, this is an incredibly rare failure scenario
given the tight race timings.

James

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-05 15:57                       ` Philipp Reisner
@ 2009-05-05 17:38                         ` Lars Marowsky-Bree
  0 siblings, 0 replies; 44+ messages in thread
From: Lars Marowsky-Bree @ 2009-05-05 17:38 UTC (permalink / raw)
  To: Philipp Reisner, Bart Van Assche
  Cc: James Bottomley, david, Willy Tarreau, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Kyle Moffett, Lars Ellenberg

On 2009-05-05T17:57:15, Philipp Reisner <philipp.reisner@linbit.com> wrote:

> Up to now we do not offer barrier support for the layers above us.
> That will follow sooner or later.
> 
> Here is an example, why it is not completely trivial:
> 
>   Imagine DRBD on top of a dm-linear on both nodes. When you start,
>   both dm-linear mappings sit on top of something that supports 
>   barriers itself. -- Then the user replaces the backing device
>   below the dm-linear on the secondary node with something that
>   does not support barriers.

The same problem exists essentially for md raid1 as well, and I'd not
consider it objectionable if you took a brutal approach:

>   When we get a write request with the BIO_RW_BARRIER flag set
>   in from the FS, we submit this locally, ship it over to the
>   peer and submit it there. Unfortunately it fails now with
>   ENOTSUP on the peer. 
> 
>   We can not ship that error back to the upper layer, because
>   our mirror is already inconsistent.

Disconnect the secondary with a loud error as to why (incompatible
change of the device below). (Re-)negotiate barrier capability at
connect time; then, resync.


Regards,
    Lars

-- 
SuSE Labs, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-05 17:05                         ` James Bottomley
@ 2009-05-05 21:45                           ` Philipp Reisner
  2009-05-05 21:53                             ` James Bottomley
  0 siblings, 1 reply; 44+ messages in thread
From: Philipp Reisner @ 2009-05-05 21:45 UTC (permalink / raw)
  To: James Bottomley
  Cc: david, Willy Tarreau, Bart Van Assche, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

Am Dienstag 05 Mai 2009 19:05:46 schrieb James Bottomley:
> On Tue, 2009-05-05 at 17:56 +0200, Philipp Reisner wrote:
> > On Tuesday 05 May 2009 16:09:45 James Bottomley wrote:
> > > On Tue, 2009-05-05 at 10:21 +0200, Philipp Reisner wrote:
> > > > > > When you do asynchronous replication, how do you ensure that
> > > > > > implicit write-after-write dependencies in the stream of writes
> > > > > > you get from the file system above, are not violated on the
> > > > > > secondary ?
> >
> > [...]
> >
> > > > > The way nbd does it (in the updated tools is to use DIRECT_IO and
> > > > > fsync).
> >
> > [...]
> >
> > > I think you'll find the dio/fsync method above actually does solve all
> > > of these issues (mainly because it enforces the semantics from top to
> > > bottom in the stack).  I agree one could use more elaborate semantics
> > > like you do for drbd, but since the simple ones worked efficiently for
> > > md/nbd, there didn't seem to be much point.
> >
> > Do I get it right, that you enforce the exact same write order on the
> > secondary node as the stream of writes was comming in on the primary?
>
> Um, yes ... that's the text book way of doing replication: write order
> preservation.
>
> > Using either DIRECT_IO or fsync() calls ?
>
> Yes.
>
> > Is DIRECT_IO/fsync() enabled by default ?
>
> I'd have to look at the tools (and, unfortunately, there are many
> variants) but it was certainly true in the variant I used.

[...]

My experience is that enforcing the exact same write order as on the primary
by using IO draining, kills performance. - Of course things are changing in 
a world where everybody uses a RAID controller with a gig of battery
backed RAM. But there are for sure some embedded users that run
the replication technology on top of plain hard disks.

What I want to work out is, that in DRBD we have that capability to allow
limited reordering on the secondary, to achieve the highest possible
performance, while maintaining these implicit write-after-write dependencies.

> I also think you're not quite looking at the important case: if you
> think about it, the real necessity for the ordered domain is the
> network, not so much the actual secondary server.  The reason is that
> it's very hard to find a failure case where the write order on the
> secondary from the network tap to disk actually matters (as long as the
> flight into the network tap was in order).  The standard failure is of
> the primary, not the secondary, so the network stream stops and so does
> the secondary writing: as long as we guarantee to stop at a consistent
> point in flight, everything works.  If the secondary fails while the
> primary is still up, that's just a standard replay to bring the
> secondary back into replication, so the issue doesn't arise there
> either.

A common power failure is possible. We aim for an HA system, we can
not ignore a possible failure scenario. No user will buy: Well in most
scenarios we do it correctly, in the unlikely case of a common power
failure, and you loose your former primary at the same time, you might
have a secondary with the last write but not that one write before!

Correctness before efficiency!

But I will now stop this discussion now. Proving that DRBD does some
details better than the md/nbd approch gets pointless, when we agreed
that DRBD can get merged as a driver. We will focus on the necessary
code cleanups.

-Phil


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-05 21:45                           ` Philipp Reisner
@ 2009-05-05 21:53                             ` James Bottomley
  2009-05-06  8:17                               ` Philipp Reisner
  0 siblings, 1 reply; 44+ messages in thread
From: James Bottomley @ 2009-05-05 21:53 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: david, Willy Tarreau, Bart Van Assche, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

On Tue, 2009-05-05 at 23:45 +0200, Philipp Reisner wrote:
> > I also think you're not quite looking at the important case: if you
> > think about it, the real necessity for the ordered domain is the
> > network, not so much the actual secondary server.  The reason is that
> > it's very hard to find a failure case where the write order on the
> > secondary from the network tap to disk actually matters (as long as the
> > flight into the network tap was in order).  The standard failure is of
> > the primary, not the secondary, so the network stream stops and so does
> > the secondary writing: as long as we guarantee to stop at a consistent
> > point in flight, everything works.  If the secondary fails while the
> > primary is still up, that's just a standard replay to bring the
> > secondary back into replication, so the issue doesn't arise there
> > either.
> 
> A common power failure is possible. We aim for an HA system, we can
> not ignore a possible failure scenario. No user will buy: Well in most
> scenarios we do it correctly, in the unlikely case of a common power
> failure, and you loose your former primary at the same time, you might
> have a secondary with the last write but not that one write before!
> 
> Correctness before efficiency!

Well, you have to agree that during a resync from the activity log,
which plays up the primary disk from one end to another, the secondary
is completely corrupt if a primary failure occurs before the resync
completes.  That's something that's triggered by a network outage, and
so is a far more common event than cascading dual failures.  It's all
really a question of where you focus your effort to eliminate the corner
cases.

> But I will now stop this discussion now. Proving that DRBD does some
> details better than the md/nbd approch gets pointless, when we agreed
> that DRBD can get merged as a driver. We will focus on the necessary
> code cleanups.

I agree.  Also HA is full of corner cases like this and opinion is
endlessly divided over which corner cases are more important than which
others.

James



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-03 21:32       ` Lars Ellenberg
  2009-05-04 16:12         ` Lars Marowsky-Bree
@ 2009-05-05 22:08         ` Lars Ellenberg
  1 sibling, 0 replies; 44+ messages in thread
From: Lars Ellenberg @ 2009-05-05 22:08 UTC (permalink / raw)
  To: Neil Brown
  Cc: Philipp Reisner, linux-kernel, Jens Axboe, Greg KH,
	James Bottomley, Sam Ravnborg, Dave Jones, Nikanth Karthikesan,
	Lars Marowsky-Bree, Nicholas A. Bellinger, Kyle Moffett,
	Bart Van Assche

[-- Attachment #1: Type: text/plain, Size: 2654 bytes --]

> > > but to answer the question:
> > > why bother to implement our own encoding?
> > > because we know a lot about the data to be encoded.
> > > 
> > > the compression of the bitmap transfer we just added very recently.
> > > for a bitmap, with large chunks of bits set or unset, it is efficient
> > > to just code the runlength.
> > > to use gzip in kernel would add yet an other huge overhead for code
> > > tables and so on.
> > > during testing of this encoding, applying it to an already gzip'ed file
> > > was able to compress it even further, btw.
> > > though on english plain text, gzip compression is _much_ more effective.
> > 
> > I just tried a little experiment.
> > I created a 128meg file and randomly set 1000 bits in it.
> > I compressed it with "gzip --best" and the result was 4Meg.  Not
> > particularly impressive.
> > I then tried to compress it wit bzip2 and got 3452 bytes.
> > Now *that* is impressive.  I suspect your encoding might do a little
> > better, but I wonder if it is worth the effort.
> 
> The effort is minimal.
> The cpu overhead is negligible (compared with bzip2, or any other
> generic compression scheme), and the memory overhead is next to none
> (just a small scratch buffer, to assemble the network packet).
> No tables or anything involved.
> Especially the _decoding_ part has this nice property:
>   chunk = 0;
>   while (!eof) {
> 	vli_decode_bits(&rl, input); /* number of unset bits */
> 	chunk += rl;
> 	vli_decode_bits(&rl, input); /* number of set bits */
> 	bitmap_dirty_bits(bitmap, chunk, chunk + rl);
> 	chunk += rl;
>  }
> 
> The source code is there.
> 
> For your example, on average you'd have (128 << 23) / 1000 "clear" bits,
> then one set bit. The encoding transfers
> "first bit unset -- ca. (1<<20), 1, ca. (1<<20), 1, ca. (1<<20), 1, ...",
> using 2 bits for the "1", and up to 29 bit for the "ca. 1<<20".
> should be in the very same ballpark as your bzip2 result.
> 
> > I'm not certain that my test file is entirely realistic, but it is
> > still an interesting experiment.
> 
> It is not ;) but still...
> If you are interessted, I can dig up my throw away user land code,
> that has been used to evaluate various such schemes.
> But it is so ugly that I won't post it to lkml.

I found round about ten different versions of that throw away code.
Oh well.  So I just hacked up an other one.

For your entertainment, prepare some example bitmaps.  From all my
real-world example bitmaps I can see that (at least with 4KiB bitmap
granularity), areas with alternating single-bit (which is the only run
length that does not compress) are rare.

Comments wellcome.  Have fun.

	Lars

[-- Attachment #2: vli_bitstream_demo.c --]
[-- Type: text/x-csrc, Size: 28391 bytes --]

/* vim: set foldmethod=marker foldlevel=1 foldenable :
 *
 * Copyright 2009 Lars Ellenberg <lars@linbit.com>
 * Licence: GPL
 *
 * Purpose: demonstrate the simple, but efficient (for bitmaps, anyways),
 * encoding (usually: compression) used to exchange the DRBD bitmap.
 * Note: DRBD transmits incompressible chunks as plain text.
 * This demo does not, to better show the properties of the encoding method.
 * See also the comments just above the "struct code_chunk"
 *
 * More than half of this file is (almost) verbatim copy
 * from other .c and .h files, so you won't need the extra files,
 * like the generic find_next_bit.c or the drbd_vli.h.
 *
 * Tested on i686 and x86_64 Debian.
 * Might have issues on other archs, though I think it should not.
 *
 * For USAGE,
 * see show_usage_and_die() below. */

#include <sys/mman.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <stdlib.h>
#include <linux/types.h>
#include <errno.h>
#include <endian.h>
#include <byteswap.h>

/* gcc -g -O3 -Wall -o vli_bitstream_demo vli_bitstream_demo.c */
/* for easy debugging with gdb, do
 *  gdb --args ./x 3< IN 4> OUT
 *  then set in_fd and out_fd to 3 and 4,
 *  and "run".
 */
static int in_fd = 0;  /* 0: stdin */
static int out_fd = 1; /* 1: stdout */

/* (almost) verbatim copied files from elsewhere {{{2 */

/* find bit helpers from linux kernel tree {{{3 */

/* taken from arch/x86/include/asm/bitops.h {{{4 */
static inline unsigned long __ffs(unsigned long word)
{
	asm("bsf %1,%0"
		: "=r" (word)
		: "rm" (word));
	return word;
}

static inline unsigned long ffz(unsigned long word)
{
	asm("bsf %1,%0"
		: "=r" (word)
		: "r" (~word));
	return word;
}

/* taken from lib/find_next_bit.c {{{4 */

/* find_next_bit.c: fallback find next bit implementation
 *
 * Copyright (C) 2004 Red Hat, Inc. All Rights Reserved.
 * Written by David Howells (dhowells@redhat.com)
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version
 * 2 of the License, or (at your option) any later version.
 */

#define BITS_PER_LONG		(sizeof(long)*8)
#define BITOP_WORD(nr)		((nr) / BITS_PER_LONG)

/*
 * Find the next set bit in a memory region.
 */
unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
			    unsigned long offset)
{
	const unsigned long *p = addr + BITOP_WORD(offset);
	unsigned long result = offset & ~(BITS_PER_LONG-1);
	unsigned long tmp;

	if (offset >= size)
		return size;
	size -= result;
	offset %= BITS_PER_LONG;
	if (offset) {
		tmp = *(p++);
		tmp &= (~0UL << offset);
		if (size < BITS_PER_LONG)
			goto found_first;
		if (tmp)
			goto found_middle;
		size -= BITS_PER_LONG;
		result += BITS_PER_LONG;
	}
	while (size & ~(BITS_PER_LONG-1)) {
		if ((tmp = *(p++)))
			goto found_middle;
		result += BITS_PER_LONG;
		size -= BITS_PER_LONG;
	}
	if (!size)
		return result;
	tmp = *p;

found_first:
	tmp &= (~0UL >> (BITS_PER_LONG - size));
	if (tmp == 0UL)		/* Are any bits set? */
		return result + size;	/* Nope. */
found_middle:
	return result + __ffs(tmp);
}

/*
 * This implementation of find_{first,next}_zero_bit was stolen from
 * Linus' asm-alpha/bitops.h.
 */
unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
				 unsigned long offset)
{
	const unsigned long *p = addr + BITOP_WORD(offset);
	unsigned long result = offset & ~(BITS_PER_LONG-1);
	unsigned long tmp;

	if (offset >= size)
		return size;
	size -= result;
	offset %= BITS_PER_LONG;
	if (offset) {
		tmp = *(p++);
		tmp |= ~0UL >> (BITS_PER_LONG - offset);
		if (size < BITS_PER_LONG)
			goto found_first;
		if (~tmp)
			goto found_middle;
		size -= BITS_PER_LONG;
		result += BITS_PER_LONG;
	}
	while (size & ~(BITS_PER_LONG-1)) {
		if (~(tmp = *(p++)))
			goto found_middle;
		result += BITS_PER_LONG;
		size -= BITS_PER_LONG;
	}
	if (!size)
		return result;
	tmp = *p;

found_first:
	tmp |= ~0UL << size;
	if (tmp == ~0UL)	/* Are any bits zero? */
		return result + size;	/* Nope. */
found_middle:
	return result + ffz(tmp);
}

/*
 * Find the first set bit in a memory region.
 */
unsigned long find_first_bit(const unsigned long *addr, unsigned long size)
{
	const unsigned long *p = addr;
	unsigned long result = 0;
	unsigned long tmp;

	while (size & ~(BITS_PER_LONG-1)) {
		if ((tmp = *(p++)))
			goto found;
		result += BITS_PER_LONG;
		size -= BITS_PER_LONG;
	}
	if (!size)
		return result;

	tmp = (*p) & (~0UL >> (BITS_PER_LONG - size));
	if (tmp == 0UL)		/* Are any bits set? */
		return result + size;	/* Nope. */
found:
	return result + __ffs(tmp);
}

/*
 * Find the first cleared bit in a memory region.
 */
unsigned long find_first_zero_bit(const unsigned long *addr, unsigned long size)
{
	const unsigned long *p = addr;
	unsigned long result = 0;
	unsigned long tmp;

	while (size & ~(BITS_PER_LONG-1)) {
		if (~(tmp = *(p++)))
			goto found;
		result += BITS_PER_LONG;
		size -= BITS_PER_LONG;
	}
	if (!size)
		return result;

	tmp = (*p) | (~0UL << size);
	if (tmp == ~0UL)	/* Are any bits zero? */
		return result + size;	/* Nope. */
found:
	return result + ffz(tmp);
}

/* end of find_next_bit.c }}}1 */

/* the VLI code implementation from drbd_vli.h {{{3 */

#define u64 __u64
#define u8 __u8
#define BUG() abort()

#if __BYTE_ORDER == __LITTLE_ENDIAN
#define le64_to_cpu(x)	(x)
#elif __BYTE_ORDER == __BIG_ENDIAN
#define le64_to_cpu(x)	bswap_64(x)
#else
#error "endian?"
#endif

/*
 * At a granularity of 4KiB storage represented per bit,
 * and stroage sizes of several TiB,
 * and possibly small-bandwidth replication,
 * the bitmap transfer time can take much too long,
 * if transmitted in plain text.
 *
 * We try to reduce the transfered bitmap information
 * by encoding runlengths of bit polarity.
 *
 * We never actually need to encode a "zero" (runlengths are positive).
 * But then we have to store the value of the first bit.
 * The first bit of information thus shall encode if the first runlength
 * gives the number of set or unset bits.
 *
 * We assume that large areas are either completely set or unset,
 * which gives good compression with any runlength method,
 * even when encoding the runlength as fixed size 32bit/64bit integers.
 *
 * Still, there may be areas where the polarity flips every few bits,
 * and encoding the runlength sequence of those areas with fix size
 * integers would be much worse than plaintext.
 *
 * We want to encode small runlength values with minimum code length,
 * while still being able to encode a Huge run of all zeros efficiently.
 *
 * Thus we need a Variable Length Integer encoding, VLI.
 *
 * For some cases, we produce more code bits than plaintext input.
 * We need to send incompressible chunks as plaintext, skip over them
 * and then see if the next chunk compresses better.
 *
 * We don't care too much about "excellent" compression ratio for large
 * runlengths (all set/all clear): whether we achieve a factor of 100
 * or 1000 is not that much of an issue.
 * We do not want to waste too much on short runlengths in the "noisy"
 * parts of the bitmap, though.
 *
 * There are endless variants of VLI, we experimented with:
 *  * simple byte-based
 *  * various bit based with different code word length.
 *
 * To avoid yet an other configuration parameter (choice of bitmap compression
 * algorithm) which was difficult to explain and tune, we just chose the one
 * variant that turned out best in all test cases.
 * Based on real world usage patterns, with device sizes ranging from a few GiB
 * to several TiB, file server/mailserver/webserver/mysql/postgress,
 * mostly idle to really busy, the all time winner (though sometimes only
 * marginally better) is:
 */

/*
 * encoding is "visualised" as
 * __little endian__ bitstream, least significant bit first (left most)
 *
 * this particular encoding is chosen so that the prefix code
 * starts as unary encoding the level, then modified so that
 * 10 levels can be described in 8bit, with minimal overhead
 * for the smaller levels.
 *
 * Number of data bits follow fibonacci sequence, with the exception of the
 * last level (+1 data bit, so it makes 64bit total).  The only worse code when
 * encoding bit polarity runlength is 1 plain bits => 2 code bits.
prefix    data bits                                    max val  Nº data bits
0 x                                                         0x2            1
10 x                                                        0x4            1
110 xx                                                      0x8            2
1110 xxx                                                   0x10            3
11110 xxx xx                                               0x30            5
111110 xx xxxxxx                                          0x130            8
11111100  xxxxxxxx xxxxx                                 0x2130           13
11111110  xxxxxxxx xxxxxxxx xxxxx                      0x202130           21
11111101  xxxxxxxx xxxxxxxx xxxxxxxx  xxxxxxxx xx   0x400202130           34
11111111  xxxxxxxx xxxxxxxx xxxxxxxx  xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx 56
 * maximum encodable value: 0x100000400202130 == 2**56 + some */

/* compression "table":
 transmitted   x                                0.29
 as plaintext x                                  ........................
             x                                   ........................
            x                                    ........................
           x    0.59                         0.21........................
          x      ........................................................
         x       .. c ...................................................
        x    0.44.. o ...................................................
       x .......... d ...................................................
      x  .......... e ...................................................
     X.............   ...................................................
    x.............. b ...................................................
2.0x............... i ...................................................
 #X................ t ...................................................
 #................. s ...........................  plain bits  ..........
-+-----------------------------------------------------------------------
 1             16              32                              64
*/

/* LEVEL: (total bits, prefix bits, prefix value),
 * sorted ascending by number of total bits.
 * The rest of the code table is calculated at compiletime from this. */

/* fibonacci data 1, 1, ... */
#define VLI_L_1_1() do { \
	LEVEL( 2, 1, 0x00); \
	LEVEL( 3, 2, 0x01); \
	LEVEL( 5, 3, 0x03); \
	LEVEL( 7, 4, 0x07); \
	LEVEL(10, 5, 0x0f); \
	LEVEL(14, 6, 0x1f); \
	LEVEL(21, 8, 0x3f); \
	LEVEL(29, 8, 0x7f); \
	LEVEL(42, 8, 0xbf); \
	LEVEL(64, 8, 0xff); \
	} while (0)

/* finds a suitable level to decode the least significant part of in.
 * returns number of bits consumed.
 *
 * BUG() for bad input, as that would mean a buggy code table. */
static inline int vli_decode_bits(u64 *out, const u64 in)
{
	u64 adj = 1;

#define LEVEL(t,b,v)					\
	do {						\
		if ((in & ((1 << b) -1)) == v) {	\
			*out = ((in & ((~0ULL) >> (64-t))) >> b) + adj;	\
			return t;			\
		}					\
		adj += 1ULL << (t - b);			\
	} while (0)

	VLI_L_1_1();

	/* NOT REACHED, if VLI_LEVELS code table is defined properly */
	BUG();
#undef LEVEL
}

/* return number of code bits needed,
 * or negative error number */
static inline int __vli_encode_bits(u64 *out, const u64 in)
{
	u64 max = 0;
	u64 adj = 1;

	if (in == 0)
		return -EINVAL;

#define LEVEL(t,b,v) do {		\
		max += 1ULL << (t - b);	\
		if (in <= max) {	\
			if (out)	\
				*out = ((in - adj) << b) | v;	\
			return t;	\
		}			\
		adj = max + 1;		\
	} while (0)

	VLI_L_1_1();

	return -EOVERFLOW;
#undef LEVEL
}

#undef VLI_L_1_1

/* code from here down is independend of actually used bit code */

/*
 * Code length is determined by some unique (e.g. unary) prefix.
 * This encodes arbitrary bit length, not whole bytes: we have a bit-stream,
 * not a byte stream.
 */

/* for the bitstream, we need a cursor */
struct bitstream_cursor {
	/* the current byte */
	u8 *b;
	/* the current bit within *b, nomalized: 0..7 */
	unsigned int bit;
};

/* initialize cursor to point to first bit of stream */
static inline void bitstream_cursor_reset(struct bitstream_cursor *cur, void *s)
{
	cur->b = s;
	cur->bit = 0;
}

/* advance cursor by that many bits; maximum expected input value: 64,
 * but depending on VLI implementation, it may be more. */
static inline void bitstream_cursor_advance(struct bitstream_cursor *cur, unsigned int bits)
{
	bits += cur->bit;
	cur->b = cur->b + (bits >> 3);
	cur->bit = bits & 7;
}

/* the bitstream itself knows its length */
struct bitstream {
	struct bitstream_cursor cur;
	unsigned char *buf;
	size_t buf_len;		/* in bytes */

	/* for input stream:
	 * number of trailing 0 bits for padding
	 * total number of valid bits in stream: buf_len * 8 - pad_bits */
	unsigned int pad_bits;
};

static inline void bitstream_init(struct bitstream *bs, void *s, size_t len, unsigned int pad_bits)
{
	bs->buf = s;
	bs->buf_len = len;
	bs->pad_bits = pad_bits;
	bitstream_cursor_reset(&bs->cur, bs->buf);
}

static inline void bitstream_rewind(struct bitstream *bs)
{
	bitstream_cursor_reset(&bs->cur, bs->buf);
	memset(bs->buf, 0, bs->buf_len);
}

/* Put (at most 64) least significant bits of val into bitstream, and advance cursor.
 * Ignores "pad_bits".
 * Returns zero if bits == 0 (nothing to do).
 * Returns number of bits used if successful.
 *
 * If there is not enough room left in bitstream,
 * leaves bitstream unchanged and returns -ENOBUFS.
 */
static inline int bitstream_put_bits(struct bitstream *bs, u64 val, const unsigned int bits)
{
	unsigned char *b = bs->cur.b;
	unsigned int tmp;

	if (bits == 0)
		return 0;

	if ((bs->cur.b + ((bs->cur.bit + bits -1) >> 3)) - bs->buf >= bs->buf_len)
		return -ENOBUFS;

	/* paranoia: strip off hi bits; they should not be set anyways. */
	if (bits < 64)
		val &= ~0ULL >> (64 - bits);

	*b++ |= (val & 0xff) << bs->cur.bit;

	for (tmp = 8 - bs->cur.bit; tmp < bits; tmp += 8)
		*b++ |= (val >> tmp) & 0xff;

	bitstream_cursor_advance(&bs->cur, bits);
	return bits;
}

/* Fetch (at most 64) bits from bitstream into *out, and advance cursor.
 *
 * If more than 64 bits are requested, returns -EINVAL and leave *out unchanged.
 *
 * If there are less than the requested number of valid bits left in the
 * bitstream, still fetches all available bits.
 *
 * Returns number of actually fetched bits.
 */
static inline int bitstream_get_bits(struct bitstream *bs, u64 *out, int bits)
{
	u64 val;
	unsigned int n;

	if (bits > 64)
		return -EINVAL;

	if (bs->cur.b + ((bs->cur.bit + bs->pad_bits + bits -1) >> 3) - bs->buf >= bs->buf_len)
		bits = ((bs->buf_len - (bs->cur.b - bs->buf)) << 3)
			- bs->cur.bit - bs->pad_bits;

	if (bits == 0) {
		*out = 0;
		return 0;
	}

	/* get the high bits */
	val = 0;
	n = (bs->cur.bit + bits + 7) >> 3;
	/* n may be at most 9, if cur.bit + bits > 64 */
	/* which means this copies at most 8 byte */
	if (n) {
		memcpy(&val, bs->cur.b+1, n - 1);
		val = le64_to_cpu(val) << (8 - bs->cur.bit);
	}

	/* we still need the low bits */
	val |= bs->cur.b[0] >> bs->cur.bit;

	/* and mask out bits we don't want */
	val &= ~0ULL >> (64 - bits);

	bitstream_cursor_advance(&bs->cur, bits);
	*out = val;

	return bits;
}

/* encodes @in as vli into @bs;

 * return values
 *  > 0: number of bits successfully stored in bitstream
 * -ENOBUFS @bs is full
 * -EINVAL input zero (invalid)
 * -EOVERFLOW input too large for this vli code (invalid)
 */
static inline int vli_encode_bits(struct bitstream *bs, u64 in)
{
	u64 code = code;
	int bits = __vli_encode_bits(&code, in);

	if (bits <= 0)
		return bits;

	return bitstream_put_bits(bs, code, bits);
}

/* end of drbd_vli.h }}}1 */

static char *progname;

void show_usage_and_die(void)
{
	fprintf(stderr,
"Usage: %s subcommand subcommand-options ...\n"
"  Subcommands:\n"
"  plain-to-rl <in_file>:\n"
"    mmap()s in_file,\n"
"    Writes a stream of 32bit native unsigned int to stdout,\n"
"    representing the runlengths of set and unset bits.\n"
"    First runlength is _unset_ bits,\n"
"    and is the only runlength that may be zero.\n\n"

"  rl-to-vli\n"
"    Takes the output of 'plain-to-rl' from stdin,\n"
"    and encodes those runlengths into variable length integer (VLI)\n"
"    bitstream chunks (to stdout).\n"
"    export RL_TO_VLI_STATS=any-value gives extra stats on this one.\n\n"

"  vli-to-rl\n"
"    Takes the output of 'rl-to-vli' from stdin,\n"
"    and decodes those VLI chunks back into a stream of\n"
"    32bit native unsigned int (to stdout).\n\n"

"  rl-to-plain\n"
"    Takes a stream of 32bit native unsigned int runlengths from stdin\n"
"    and writes the corresponding plaintext to stdout.\n\n"

"Note that in the proper implementation, incompressible chunks\n"
"are stored as plain. This is not done here, to better demonstrate\n"
"the properties of this method\n\n"

"shell examples:\n"
"  do_try() { ( set -vxeC\n"
"    local f=${1:? missing input file};\n"
"    export RL_TO_VLI_STATS=whatever\n"
"    time %s plain-to-rl \"$f\" > RL;\n"
"    time %s rl-to-vli < RL > BS;\n"
"    time %s vli-to-rl < BS > DE;\n"
"    time %s rl-to-plain < DE > R;\n"
"    ls -l \"$f\" R BS RL DE;\n"
"    time cmp RL DE;\n"
"    time cmp \"$f\" R )\n"
"  }\n"
"  # rm -f RL BS DE R # for cleanup, because of set -C\n"
"  do_try example_bitmap\n"
"The bit stream in BS is the compressed representation of infile\n\n"

"Stupid comparison for the fun of it. Prepare some example_bitmap, then:\n"
"f=example_bitmap\n"
"time { gzip --best < $f | tee C | gunzip > /dev/null; ls -l C; }\n"
"time { bzip2 < $f | tee C | bunzip2 > /dev/null; ls -l C; }\n"
"time { %s plain-to-rl $f | %s rl-to-vli |\n"
"       tee C | %s vli-to-rl |\n"
"       %s rl-to-plain > /dev/null; ls -l C; }\n",
		progname,
		progname, progname, progname, progname,
		progname, progname, progname, progname);
	exit(1);
}

static char *subcmd;
#define eprintf(fmt, args...) fprintf(stderr, "%s: " fmt, subcmd , ## args)

#define RING_SIZE 1024
static unsigned RL[RING_SIZE];

/* plain-to-rl {{{2 */

/* also used in vli-to-rl */
void write_RL(int i)
{
	ssize_t s = i * sizeof(RL[0]);
	ssize_t c = write(out_fd, RL, s);
	if (c < 0) {
		eprintf("error writing runlength blob to stdout: %m\n");
		exit(6);
	}
	if (c != s) {
		eprintf("short write: %lu != %lu\n",
				(unsigned long)c, (unsigned long)s);
		exit(7);
	}
}

/* don't risk bit number wrap around.
 * could be coded around, of course. */
#define MAX_BYTES ((off_t)((unsigned int)(~0) >> 3))

int plain_to_rl(int argc, char **argv)
{
	struct stat sb;
	unsigned long current_bit = 0;
	unsigned long n_bits;
	unsigned long tmp;
	unsigned long *bm;
	char *in_file;
	int toggle = 0;
	int i = 0;
	int fd;

	if (argc != 1) {
		eprintf("missing input file argument\n");
		return 1;
	}

	in_file = argv[0];
	fd = open(in_file, O_RDONLY);
	if (fd < 0) {
		eprintf("open('%s'): %m\n", in_file);
		return 2;
	}

	if (fstat(fd, &sb)) {
		eprintf("fstat(%s): %m\n", in_file);
		return 3;
	}

	if ((sb.st_mode & S_IFMT) != S_IFREG) {
		eprintf("%s: not a regular file\n", in_file);
		return 4;
	}

	if (sb.st_size > MAX_BYTES) {
		eprintf("%s too big, only scanning first %lu bytes\n",
				in_file, MAX_BYTES);
		sb.st_size = MAX_BYTES;
	}

	/* maybe TODO: allow start offset and size to be specified,
	 * possibly use mmap2 */
	bm = mmap(NULL, sb.st_size, PROT_READ, MAP_SHARED | MAP_POPULATE, fd, 0);
	if (bm == MAP_FAILED) {
		eprintf("mmap(%s): %m\n", in_file);
		return 5;
	}

	n_bits = sb.st_size << 3;
	for (;;) {
		toggle = !toggle;

		tmp = toggle ? find_next_bit(bm, n_bits, current_bit)
		             : find_next_zero_bit(bm, n_bits, current_bit);

		if (tmp >= n_bits)
			RL[i++] = n_bits - current_bit;
		else
			RL[i++] = tmp - current_bit;
		if (i == RING_SIZE || tmp >= n_bits) {
			write_RL(i);
			if (tmp >= n_bits)
				break;
			i = 0;
		}
		current_bit = tmp;
	}
	close(1);
	return 0;
}

/* plain-to-rl }}}1 */

/* rl-to-vli {{{2 */

/* drbd on-network packet is preceeded by an 8 byte header, btw.
 * also, we transmit an incompressible chunk as plain, obviously.
 * left out here for demonstation of the encoding properties only. */

/* also used from vli_to_rl */
/* CAUTION: do not increase OUTPUT_BUFF_SIZE without also changing
 * the chunk.head format, see head_to_bytes() below.
 * in a proper implementation, you would add some magic or checksum,
 * plus a flag for interleaved plaintext (for incompressible chunks). */
#define OUTPUT_BUFF_SIZE 4096
static struct code_chunk {
	unsigned short head;
#define head_to_bytes(head)		((head & 0x0fff) +1)
#define head_to_pad_bits(head)		((head & 0x7000) >> 12)
#define head_is_first_bit_set(head)	((head & 0x8000) != 0)

	unsigned char code[OUTPUT_BUFF_SIZE];
} chunk;

ssize_t pipe_read(int fd, void *buf, ssize_t l)
{
	ssize_t c, count = 0;
	int loop = 0;
	do {
		c = read(in_fd, buf + count, l);
		if (c < 0) {
			eprintf("error reading from stdin: %m\n");
			exit(2);
		}
		if (c == 0 && ++loop > 3)
			break;
		count += c;
		l -= c;
	} while (l);
	return count;
}

int read_RL()
{
	int count = pipe_read(in_fd, RL, sizeof(RL));
	if (count % sizeof(RL[0])) {
		eprintf("short read, not modulo native unsingned int!\n"
				" %u %% %u == %u\n", count, (unsigned)sizeof(RL[0]),
				count % (unsigned)sizeof(RL[0]));
		exit(2);
	}
	return count / sizeof(RL[0]);
}

/* one could save (8*sizeof(head) + up to 7) bits every chunk, by just
 * streaming the code bits one after the other.
 * but then you have no easy way to detect truncated code during decode */
void write_code_chunk_rewind_bs(struct bitstream *bs, int first_bit_set)
{
	ssize_t c;
	ssize_t s = bs->cur.b - bs->buf + !!bs->cur.bit;

	chunk.head = first_bit_set ? 0x8000 : 0;
	/* pad bits */
	chunk.head |= (0x7 & (8 - bs->cur.bit)) << 12;
	/* code bytes */
	chunk.head |= 0x0fff & (s - 1);

	/* should not happen? */
	if (!s)
		return;

	s += sizeof(chunk.head);
	c = write(out_fd, &chunk, s);
	if (c < 0) {
		eprintf("error writing code blob to stdout: %m\n");
		exit(6);
	}
	if (c != s) {
		eprintf("short write: %lu != %lu\n",
				(unsigned long)c, (unsigned long)s);
		exit(7);
	}
	bitstream_rewind(bs);
}

static struct {
       unsigned n;
       unsigned plain;
       unsigned code;
} stats[2]; /* compressible, incompressible */

void do_stats(struct bitstream *bs, int n_chunks, unsigned long plain)
{
	unsigned long code = ((bs->cur.b - bs->buf) + !!bs->cur.bit + 2) * 8;
	/* eprintf("chunk:%u plain_bits:%lu code_bits:%u\n",
		n_chunks, plain, code); */
	++stats[code > plain].n;
	stats[code > plain].plain += plain;
	stats[code > plain].code += code;
}

void print_stat_summary(void)
{
	unsigned total_n = stats[0].n + stats[1].n;
	unsigned long total_plain_bits = stats[0].plain + stats[1].plain;
	unsigned long total_code_bits = stats[0].code + stats[1].code;

	eprintf("stats: %u chunks, %u compressed, %u uncompressed\n"
		"\tonly compressible: %u plain bits -> %u code bits\n",
		total_n, stats[0].n, stats[1].n,
		stats[0].plain, stats[0].code);
	eprintf("total saved: %.2f%%\n",
			100.0 * (total_plain_bits - total_code_bits)
			    / total_plain_bits);
}

int rl_to_vli(int argc, char **argv)
{
	struct bitstream bs;
	unsigned long plain_bits = 0;
	int l = read_RL();
	int first_is_set;
	int odd_is_set = 1;
	int i;
	int bits;
	int stats = NULL != getenv("RL_TO_VLI_STATS");
	int n_chunks = 0;

	if (l == 0) {
		eprintf("empty input!\n");
		return 3;
	}

	bitstream_init(&bs, chunk.code, OUTPUT_BUFF_SIZE, 0);
	i = !RL[0];
	first_is_set = i;
	do {
		while (i < l) {
			/* paranoia: catch zero runlength.
			 * can only happen if bitmap was modified while it was scanned. */
			if (RL[i] == 0) {
				eprintf("unexpected zero runlength i=%d\n", i);
				return 4;
			}
redo:
			bits = vli_encode_bits(&bs, RL[i]);
			if (bits == -ENOBUFS) { /* buffer full */
				if (stats) {
					do_stats(&bs, n_chunks++, plain_bits);
					plain_bits = 0;
				}
				write_code_chunk_rewind_bs(&bs, first_is_set);
				/* current will be first of next packet */
				first_is_set = (i & 1) == odd_is_set;
				goto redo;
			}
			if (bits <= 0) {
				eprintf("error while encoding runlength: %d\n", bits);
				return 5;
			}
			if (stats)
				plain_bits += RL[i];
			i++;
		}
		if (l & 1)
			odd_is_set = !odd_is_set;
		l = read_RL();
		i = 0;
	} while (l);
	if (bs.cur.b != bs.buf || bs.cur.bit)
		write_code_chunk_rewind_bs(&bs, first_is_set);
	if (stats)
		print_stat_summary();
	return 0;
}

/* rl-to-vli }}}1 */

/* vli-to-rl {{{2 */

ssize_t read_one_code_chunk(struct bitstream *bs)
{
	static unsigned long total_read_bytes;
	ssize_t len;
	int bytes;

	len = pipe_read(in_fd, &chunk.head, sizeof(chunk.head));
	if (len == 0)
		return 0;
	if (len != 2) {
		eprintf("short read reading chunk.head\n");
		exit(2);
	}
	total_read_bytes += len;
	bytes = head_to_bytes(chunk.head);
	len = pipe_read(in_fd, &chunk.code, bytes);
	if (len != bytes) {
		eprintf("short read reading chunk.code: %d %d (%lu)\n",
			(unsigned)len, bytes, total_read_bytes);
		exit(2);
	}
	total_read_bytes += len;
	bitstream_init(bs, chunk.code, bytes, head_to_pad_bits(chunk.head));
	return len + 2;
}

int vli_to_rl(int argc, char **argv)
{
	struct bitstream bs;
	__u64 look_ahead;
	__u64 tmp;
	__u64 rl;
	int toggle;
	int have;
	int bits;
	int first = 1;
	int i = 0;

	while (read_one_code_chunk(&bs)) {
		look_ahead = 0;
		tmp = 0;
		have = 0;
		toggle = !head_is_first_bit_set(chunk.head);
		if (first) {
			first = 0;
			if (!toggle)
				RL[i++] = 0;
		}
		for (;;) {
			toggle = !toggle;
			/* get fresh bits */
			bits = bitstream_get_bits(&bs, &tmp, 64 - have);
			if (bits < 0)
				return 3;
			look_ahead |= tmp << have;
			have += bits;
			if (have == 0)
				break;

			/* consume one code number */
			bits = vli_decode_bits(&rl, look_ahead);
			if (bits <= 0)
				return 4;
			/* cannot possibly decode more bits than I had */
			if (have < bits)
				return 5;
			look_ahead >>= bits;
			have -= bits;

			RL[i++] = rl;
			if (i == RING_SIZE) {
				write_RL(i);
				i = 0;
			}
		}
	}
	write_RL(i);
	return 0;
}

/* vli-to-rl }}}1 */

/* rl-to-plain {{{2 */

int rl_to_plain(int argc, char **argv)
{
	/* yep, this is excessive, and only used here to quick'n'dirty
	 * code something that is streamable. */
	static unsigned char zeros[4096];
	static unsigned char FFFFs[4096];

	unsigned char *out;
	unsigned int rl;
	int l;
	int i;
	int set = 0;
	int minibuf_bits = 0;
	unsigned char minibuf = 0;

	/* no need to initialize zeros, static did that for us */
	memset(FFFFs, 0xff, sizeof(FFFFs));

	while ((l = read_RL())) {
		for (i = 0; i < l; i++, set = !set) {
			rl = RL[i];
			if (minibuf_bits || rl < 8) {
				if (set) /* set bits */
					minibuf |= ((1 << (rl < 7 ? rl : 7)) -1)
						<< minibuf_bits;
				if (rl < 8 - minibuf_bits) {
					minibuf_bits += rl;
					continue;
				}
				if (write(out_fd, &minibuf, 1) != 1) {
					eprintf("FIXME 1\n");
					exit(2);
				}
				rl -= 8 - minibuf_bits;
				minibuf = 0;
				minibuf_bits = 0;
			}

			out = set ? FFFFs : zeros;
			while (rl > 8) {
				size_t c = rl/8; /* 8 bits per byte */
				if (sizeof(FFFFs) < c)
					c = sizeof(FFFFs);
				if (write(out_fd, out, c) != c) {
					eprintf("FIXME 2\n");
					exit(2);
				}
				rl -= c * 8;
			}
			if (rl) {
				if (set)
					minibuf = (1 << rl) -1;
				minibuf_bits = rl;
			}
		}
	}
	if (minibuf_bits)
		if (write(out_fd, &minibuf, 1) != 1) {
			eprintf("FIXME 3\n");
			exit(2);
		}

	return 0;
}

/* rl_to_plain }}}1 */

int main(int argc, char **argv)
{
	progname = strrchr(argv[0], '/');
	if (!progname)
		progname = argv[0];
	else
		progname++;

	if (argc < 2)
		show_usage_and_die();
	subcmd = argv[1];
	if (!strcmp(subcmd, "plain-to-rl"))
		return plain_to_rl(argc-2, argv+2);
	if (!strcmp(subcmd, "rl-to-vli"))
		return rl_to_vli(argc-2, argv+2);
	if (!strcmp(subcmd, "vli-to-rl"))
		return vli_to_rl(argc-2, argv+2);
	if (!strcmp(subcmd, "rl-to-plain"))
		return rl_to_plain(argc-2, argv+2);

	fprintf(stderr, "%s %s: unimplemented subcommand\n", progname, subcmd);
	return 1;
}

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
  2009-05-05 21:53                             ` James Bottomley
@ 2009-05-06  8:17                               ` Philipp Reisner
  0 siblings, 0 replies; 44+ messages in thread
From: Philipp Reisner @ 2009-05-06  8:17 UTC (permalink / raw)
  To: James Bottomley
  Cc: david, Willy Tarreau, Bart Van Assche, Andrew Morton,
	linux-kernel, Jens Axboe, Greg KH, Neil Brown, Sam Ravnborg,
	Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree, Kyle Moffett,
	Lars Ellenberg

[...]
>
> Well, you have to agree that during a resync from the activity log,
> which plays up the primary disk from one end to another, the secondary
> is completely corrupt if a primary failure occurs before the resync
> completes.  That's something that's triggered by a network outage, and
> so is a far more common event than cascading dual failures.  It's all
> really a question of where you focus your effort to eliminate the corner
> cases.
>

I fully agree. Just not not leave this unanswered: With DRBD we provide
a snapshot-resync-target handler. By using LVM's snapshotting
mechanism a snapshot is taken before it becomes resync-target. In
case the resync completes gracefully, the snapshot is automatically
removed.

Which is still inferior to a full transaction log on the secondary.

-Phil


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] DRBD: a block device for HA clusters
@ 2009-05-14 22:31 devzero
  0 siblings, 0 replies; 44+ messages in thread
From: devzero @ 2009-05-14 22:31 UTC (permalink / raw)
  To: bart.vanassche; +Cc: akpm, linux-kernel, philipp.reisner

>On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
><akpm@linux-foundation.org> wrote:
>> On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:
>>
>>> This is a repost of DRBD
>>
>> Is it being used anywhere for anything?  If so, where and what?
>
>One popular application is to run iSCSI and HA software on top of DRBD
>in order to build a highly available iSCSI storage target.
>
>Bart.

Iirc, Xtravirt`s XVS Virtual SAN appliance is built around DRBD.

For those interested:

Some Blog entry:
http://vmetc.com/2008/05/23/xtravirt-xvs-creates-a-free-san-out-of-local-esx-vmfs/

Design sheet:
http://communities.vmware.com/servlet/JiveServlet/download/950436-9486/xvsrefdiag.jpg

Discussion:
http://communities.vmware.com/message/1114092#1114092

Seems to be sold to PHD Virtual Technologies now....

regards
roland

______________________________________________________
GRATIS für alle WEB.DE-Nutzer: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://movieflat.web.de


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 00/16] DRBD: a block device for HA clusters
@ 2009-05-15 12:10 Philipp Reisner
  0 siblings, 0 replies; 44+ messages in thread
From: Philipp Reisner @ 2009-05-15 12:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Jens Axboe, Greg KH, Neil Brown, James Bottomley,
	Sam Ravnborg, Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Lars Ellenberg, Philipp Reisner

Hi,

This is a repost of DRBD, to keep you updated about the ongoing
cleanups and improvements.

Patch set attached. Git tree available:
git pull git://git.drbd.org/linux-2.6-drbd.git drbd

We are looking for reviews!

Note for reviewers:

  Only the first two patches (major.h and lru_cache) are self contained.
  The other patches are just split at file boundaries. Sorry, DRBD
  was developed as out-of-tree modules just for too long.

Short Description

  DRBD is a shared-nothing, synchronously replicated block device. It
  is designed to serve as a building block for high availability
  clusters and in this context, is a "drop-in" replacement for shared
  storage. Simplistically, you could see it as a network RAID 1.

  More information can be found at http://www.drbd.org

Changes since 2009-04-30

  * Cleanup: Removed typecasts, more documentation in lru_cache. Moved to /lib
  * Cleanup: replaced __attribute__((packed)) with __packed
  * Cleanup: remove quite a few 'inline's from .c files
  * Cleanup: renaming a few constants: _SECT -> _SECTOR_SIZE, _SIZE_B -> _SHIFT ...
  * Cleanup: rename inc_local -> get_ldev; inc_net -> get_net_conf; and corresponding dec_* -> put_*
  * Cleanup: rename mdev->bc to mdev->ldev (to match the recent change to get_ldev/put_ldev)
  * Cleanup: Made function comments kernel-doc compliant
  * Cleanup: vmalloc() only as a fall back for kmalloc()
  * DRBD:    Allow detach of a SyncTarget node. (Bugz 221)
  * DRBD:    Call drbd_rs_cancel_all() and reset rs_pending when aborting resync due to detach. (Bugz 223)
  * DRBD:    make drbd thread t_lock irqsave - lockdep complained, and lockdep is right (theoretically)

Changes since 2009-04-10

  * Cleanup: Removed all CamelCase
  * Cleanup: Replaced DRBD's own tracing stuff with regular tracepoints
  * Cleanup: Removed ERR/INFO/ALERT ... macros, using dev_err/dev_info/... now
  * Cleanup: Minor stuff, as suggested in feedback on LKML
  * DRBD:    Bitmap compression feature was finalised
  * DRBD:    new disable_sendpage parameter

Changes since the post on 2009-03-30, all triggered by reviews

  * Improvements to Makefile and Kconfig
  * Simplified definitions of bm_flags' bitnumbers
  * Removed debugging aid

Changes since the post on 2009-03-23, from drbd-mainline

  * Updated to the final drbd-8.3.1 code
  * Optionally run-length encode bitmap transfers

Changes since the post on 2009-03-23, triggered by reviews

  * Using the latest proc_create() now
  * Moved the allocation of md_io_tmpp to attach/detach out of drbd_md_sync_page_io()
  * Removing the mode selection comments for emacs
  * Removed DRBD_ratelimit()

cheers,
  Phil

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 00/16] drbd: a block device for HA clusters
@ 2009-07-06 15:39 Philipp Reisner
  2009-07-21  5:49 ` Andrew Morton
  0 siblings, 1 reply; 44+ messages in thread
From: Philipp Reisner @ 2009-07-06 15:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Jens Axboe, Greg KH, Neil Brown, James Bottomley,
	Sam Ravnborg, Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Christoph Hellwig, drbd-dev, Lars Ellenberg, Philipp Reisner

Hi,

As the first bit of the DBRD patch already got upstream (see commit
10fc89d01a) it is time to get more of DRBD towards mainline.

Here is a post of drbd-8.3.2 for inclusion into linux-mm (or linux-next).

Patch set attached. Git tree available:
git pull git://git.drbd.org/linux-2.6-drbd.git drbd

In case you want to review the code, here is a note for you:

  Only the first patch (lru_cache) is self contained. The other patches are
  just split at file boundaries. Sorry, DRBD was developed as out-of-tree
  modules just for too long.

Short Description

  DRBD is a shared-nothing, replicated block device. It is designed to
  serve as a building block for high availability clusters and

  in this context, is a "drop-in" replacement for shared storage.
  Simplistically, you could see it as a network RAID 1.

  More information can be found at http://www.drbd.org

Changes since 2009-06-26

  * Cleanup: Added an entry to the MAINTAINERS file
  * DRBD:    Now at drbd-8.3.2:
  * DRBD:    Fixed a hard to trigger race condition. (kmap_atomic(..., KM_IRQ1) interruptible)

Changes since 2009-05-15

  * Cleanup: Moved lru_cache.c to /lib
  * Cleanup: all STATIC -> static
  * Cleanup: Removed drbd_config.h ; New Kconfig option: CONFIG_DRBD_FAULT_INJECTION
  * Cleanup: Removed drbd_buildtag.c
  * DRBD:    Following DRBD-upstream, now at 8.3.2-rc2. Relevant changes:
  * DRBD:    lru_cache: use pointer arrays and kmem_cache
  * DRBD:    Fixed for building on big endian architectures
  * DRBD:    Fixed nl stuff to work on architectures that does not do unaligned memory accesses
  * DRBD:    Deal with hash functions already ported to SHASH
  * DRBD:    GFP_KERNEL -> GFP_NOIO in various places

Changes since 2009-04-30

  * Cleanup: Removed typecasts, more documentation in lru_cache. Moved to /lib
  * Cleanup: replaced __attribute__((packed)) with __packed
  * Cleanup: remove quite a few 'inline's from .c files
  * Cleanup: renaming a few constants: _SECT -> _SECTOR_SIZE, _SIZE_B -> _SHIFT ...
  * Cleanup: rename inc_local -> get_ldev; inc_net -> get_net_conf; and corresponding dec_* -> put_*
  * Cleanup: rename mdev->bc to mdev->ldev (to match the recent change to get_ldev/put_ldev)
  * Cleanup: Made function comments kernel-doc compliant
  * Cleanup: vmalloc() only as a fall back for kmalloc()
  * DRBD:    Allow detach of a SyncTarget node. (Bugz 221)
  * DRBD:    Call drbd_rs_cancel_all() and reset rs_pending when aborting resync due to detach. (Bugz 223)
  * DRBD:    make drbd thread t_lock irqsave - lockdep complained, and lockdep is right (theoretically)

Changes since 2009-04-10

  * Cleanup: Removed all CamelCase
  * Cleanup: Replaced DRBD's own tracing stuff with regular tracepoints
  * Cleanup: Removed ERR/INFO/ALERT ... macros, using dev_err/dev_info/... now
  * Cleanup: Minor stuff, as suggested in feedback on LKML
  * DRBD:    Bitmap compression feature was finalised
  * DRBD:    new disable_sendpage parameter

Changes since the post on 2009-03-30, all triggered by reviews

  * Improvements to Makefile and Kconfig
  * Simplified definitions of bm_flags' bitnumbers
  * Removed debugging aid

Changes since the post on 2009-03-23, from drbd-mainline

  * Updated to the final drbd-8.3.1 code
  * Optionally run-length encode bitmap transfers

Changes since the post on 2009-03-23, triggered by reviews

  * Using the latest proc_create() now
  * Moved the allocation of md_io_tmpp to attach/detach out of drbd_md_sync_page_io()
  * Removing the mode selection comments for emacs
  * Removed DRBD_ratelimit()

cheers,
  Phil

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] drbd: a block device for HA clusters
  2009-07-06 15:39 [PATCH 00/16] drbd: " Philipp Reisner
@ 2009-07-21  5:49 ` Andrew Morton
  0 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2009-07-21  5:49 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: linux-kernel, Jens Axboe, Greg KH, Neil Brown, James Bottomley,
	Sam Ravnborg, Dave Jones, Nikanth Karthikesan, Lars Marowsky-Bree,
	Nicholas A. Bellinger, Kyle Moffett, Bart Van Assche,
	Christoph Hellwig, drbd-dev, Lars Ellenberg, linux-next

On Mon,  6 Jul 2009 17:39:19 +0200 Philipp Reisner <philipp.reisner@linbit.com> wrote:

> As the first bit of the DBRD patch already got upstream (see commit
> 10fc89d01a) it is time to get more of DRBD towards mainline.
> 
> Here is a post of drbd-8.3.2 for inclusion into linux-mm (or linux-next).
> 
> Patch set attached. Git tree available:
> git pull git://git.drbd.org/linux-2.6-drbd.git drbd

I don't think I can be bothered reading all this again ;)  I trust that
earlier review comments were suitably addressed.

Please prepare a tree for inclusion in linux-next, send that off to
Stephen and unless someone can identify reasons otherwise, send Linus a pull
request for 2.6.32-rc1.


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2009-07-21  5:51 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-14 22:31 [PATCH 00/16] DRBD: a block device for HA clusters devzero
  -- strict thread matches above, loose matches on Subject: below --
2009-07-06 15:39 [PATCH 00/16] drbd: " Philipp Reisner
2009-07-21  5:49 ` Andrew Morton
2009-05-15 12:10 [PATCH 00/16] DRBD: " Philipp Reisner
2009-04-30 11:26 Philipp Reisner
2009-05-01  8:59 ` Andrew Morton
2009-05-01 11:15   ` Lars Marowsky-Bree
2009-05-01 13:14     ` Dave Jones
2009-05-01 19:14       ` Andrew Morton
2009-05-05  4:05     ` Christian Kujau
2009-05-02  7:33   ` Bart Van Assche
2009-05-03  5:36     ` Willy Tarreau
2009-05-03  5:40       ` david
2009-05-03 14:21         ` James Bottomley
2009-05-03 14:36           ` david
2009-05-03 14:45             ` James Bottomley
2009-05-03 14:56               ` david
2009-05-03 15:09                 ` James Bottomley
2009-05-03 15:22                   ` david
2009-05-03 15:38                     ` James Bottomley
2009-05-03 15:48                       ` david
2009-05-03 16:02                         ` James Bottomley
2009-05-03 16:13                           ` david
2009-05-04  8:28               ` Philipp Reisner
2009-05-04 17:24                 ` James Bottomley
2009-05-05  8:21                   ` Philipp Reisner
2009-05-05 14:09                     ` James Bottomley
2009-05-05 15:56                       ` Philipp Reisner
2009-05-05 17:05                         ` James Bottomley
2009-05-05 21:45                           ` Philipp Reisner
2009-05-05 21:53                             ` James Bottomley
2009-05-06  8:17                               ` Philipp Reisner
2009-05-05 15:03                     ` Bart Van Assche
2009-05-05 15:57                       ` Philipp Reisner
2009-05-05 17:38                         ` Lars Marowsky-Bree
2009-05-03 10:06       ` Philipp Reisner
2009-05-03 10:15         ` Thomas Backlund
2009-05-03  5:53 ` Neil Brown
2009-05-03  6:24   ` david
2009-05-03  8:29   ` Lars Ellenberg
2009-05-03 11:00     ` Neil Brown
2009-05-03 21:32       ` Lars Ellenberg
2009-05-04 16:12         ` Lars Marowsky-Bree
2009-05-05 22:08         ` Lars Ellenberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox