public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: David Teigland <teigland@redhat.com>
To: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: linux-kernel@vger.kernel.org, NeilBrown <neilb@suse.de>
Subject: Re: clustered MD
Date: Wed, 10 Jun 2015 12:05:33 -0500	[thread overview]
Message-ID: <20150610170533.GD333@redhat.com> (raw)
In-Reply-To: <5578647D.102@suse.com>

On Wed, Jun 10, 2015 at 11:23:25AM -0500, Goldwyn Rodrigues wrote:
> To start with, the goal of (basic) MD RAID1 is to keep the two
> mirrored device consistent _all_ of the time. In case of a device
> failure, it should degrade the array pointing to the failed device,
> so it can be (hot)removed/replaced. Now, take the same concepts to
> multiple nodes using the same MD-RAID1 device..

"multiple nodes using the same MD-RAID1 device" concurrently!?  That's a
crucial piece information that really frames the entire topic.  That needs
to be your very first point defining the purpose of this.

How would you use the same MD-RAID1 device concurrently on multiple nodes
without a cluster file system?  Does this imply that your work is only
useful for the tiny segment of people who could use MD-RAID1 under a
cluster file system?  There was a previous implementation of this in user
space called "cmirror", built on dm, which turned out to be quite useless,
and is being deprecated.  Did you talk to cluster file system developers
and users to find out if this is worth doing?  Or are you just hoping it
turns out to be worthwhile?  That's might be answered by examples of
successful real world usage that I asked about.  We don't want to be tied
down with long term maintenance of something that isn't worth it.


> >What's different about disks being on SAN that breaks data consistency vs
> >disks being locally attached?  Where did the dlm come into the picture?
> 
> There are multiple nodes using the same shared device. Different
> nodes would be writing their own data to the shared device possibly
> using a shared filesystem such as ocfs2 on top of it. Each node
> maintains a bitmap to co-ordinate syncs between the two devices of
> the RAID. Since there are two devices, writes on the two devices can
> end at different times and must be co-ordinated.

Thank you, this is the kind of technical detail that I'm looking for.
Separate bitmaps for each node sounds like a much better design than the
cmirror design which used a single shared bitmap (I argued for using a
single bitmap when cmirror was being designed.)

Given that the cluster file system does locking to prevent concurrent
writes to the same blocks, you shouldn't need any locking in raid1 for
that.  Could elaborate on exactly when inter-node locking is needed,
i.e. what specific steps need to be coordinated?


> >>Device failure can be partial. Say, only node 1 sees that one of the
> >>device has failed (link break).  You need to "tell" other nodes not
> >>to use the device and that the array is degraded.
> >
> >Why?
> 
> Data consistency. Because the node which continues to "see" the
> failed device (on another node) as working will read stale data.

I still don't understand, but I suspect this will become clear from other
examples.


> Different nodes will be writing to different
> blocks. So, if a node fails, you need to make sure that what the
> other node has not synced between the two devices is completed by
> the one performing recovery. You need to provide a consistent view
> to all nodes.

This is getting closer to the kind of detail we need, but it's not quite
there yet.  I think a full-blown example is probably required, e.g. in
terms of specific reads and writes

1. node1 writes to block X
2. node2 ...


> Also, may I point you to linux/Documentation/md-cluster.txt?

That looks like it will be very helpful when I get to the point of
reviewing the implementation.


  reply	other threads:[~2015-06-10 17:05 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-09 18:22 clustered MD David Teigland
2015-06-09 19:26 ` Goldwyn Rodrigues
2015-06-09 19:45   ` David Teigland
2015-06-09 20:08     ` Goldwyn Rodrigues
2015-06-09 20:30       ` David Teigland
2015-06-09 20:33         ` David Lang
2015-06-10  3:33         ` Goldwyn Rodrigues
2015-06-10  8:00           ` Richard Weinberger
2015-06-10 13:59             ` Goldwyn Rodrigues
2015-06-10 15:01           ` David Teigland
2015-06-10 15:27             ` Goldwyn Rodrigues
2015-06-10 15:48               ` David Teigland
2015-06-10 16:23                 ` Goldwyn Rodrigues
2015-06-10 17:05                   ` David Teigland [this message]
2015-06-10 19:22                     ` David Teigland
2015-06-10 20:31             ` Neil Brown
2015-06-10 21:07               ` David Teigland
2015-06-10 22:11                 ` David Teigland
2015-06-10 22:50                 ` Neil Brown
2015-06-12 18:46                   ` David Teigland
2015-06-14 22:19                     ` Goldwyn Rodrigues
2015-06-23  1:34                       ` NeilBrown
2015-06-09 20:14     ` Goldwyn Rodrigues

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150610170533.GD333@redhat.com \
    --to=teigland@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=rgoldwyn@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox