public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Daniel Phillips <phillips@istop.com>
To: Lars Marowsky-Bree <lmb@suse.de>
Cc: linux-kernel@vger.kernel.org, Patrick Caulfield <pcaulfie@redhat.com>
Subject: Re: [PATCH 0/7] dlm: overview
Date: Thu, 28 Apr 2005 16:53:20 -0400	[thread overview]
Message-ID: <200504281653.21060.phillips@istop.com> (raw)
In-Reply-To: <20050428145715.GA21645@marowsky-bree.de>

On Thursday 28 April 2005 10:57, Lars Marowsky-Bree wrote:
> On 2005-04-27T18:38:18, Daniel Phillips <phillips@istop.com> wrote:
> > Uuids's at this level are inherently bogus, unless of course you have
> > more than 2**32 cluster nodes.  I don't know about you, but I do not have
> > even half that many nodes over here.
>
> This is not quite the argument. With that argument, 16 bit would be
> fine. And even then, I'd call you guilty of causing my lights to flicker
> ;-)

BlueGene is pushing the 16 bit node number boundary already, 32 bits seems 
prudent.   More is silly.  Think of the node number as more like a PID than a 
UUID.

> The argument about UUIDs goes a bit beyond that: No admin needed to
> assign them; they can stay the same even if clusters/clusters merge (in
> theory); they can be used for inter-cluster addressing too, because they
> aren't just unique within a single cluster (think clusters of clusters,
> grids etc, whatever the topology), and finally, UUID is a big enough
> blob to put all other identifiers in, be it a two bit node id, a
> nodename, 32bit IPv4 address or a 128bit IPv6.
>
> This piece is important. It defines one of the fundamental objects in
> the API.
>
> I recommend you read up on the discussions on the OCF list on this; this
> has probably been one of the hottest arguments.

Add a translation layer if you like, and submit it in the form of a user space 
library service.  Or have it be part of your own layer or application.  There 
is no compelling argument for embedding such a bloatlet to cman proper (which 
is already fatter than it should be).

> "How is the kernel component
> configured which paths/IP to use" - ie, the equivalent of ifconfig/route
> for the cluster stack,

There is a config file in /etc and a (userspace) scheme for distributing the 
file around the cluster (ccs - cluster configuration system).

How the configuration gets from the config file to kernel is a mystery to me 
at the moment, which I will hopefully solve by reading some code later 
today ;-)

> Doing this in a wrapper is one answer - in which case we'd have a
> consistent user-space API provided by shared libraries wrapping a
> specific kernel component. This places the boundary in user-space.

I believe that it is almost entirely in user space now, with the recent move 
of cman to user space.  I have not yet seen the new code, so I don't know the 
details (this egg was hatched by Dave and Patrick).

> This seems to be a main point of contention, also applicable to the
> first question about node identifiers: What does the kernel/user-space
> boundary look like, and is this the one we are aiming to clarify?

Very much so.

> Or do we place the boundary in user-space with a specific wrapper around
> a given kernel solution.

Yes.  But let's try and have a good, durable kernel solution right from the 
start.

> > Since cman has now moved to user space, userspace does not tell the
> > kernel about membership,
>
> That partial sentence already makes no sense.

Partial?

> So how does the kernel 
> (DLM in this case) learn about whether a node is assumed to be up or
> down if the membership is in user-space? Right! User-space must tell
> it.

By a message over a socket, as I said.  This is a really nice property of 
sockets: when cman moved from kernel to user space, (g)dlm was hardly 
affected at all.

> For example, with OCFS2 (w/user-space membership, which it doesn't yet
> have, but which they keep telling me is trivial to add, but we're
> busying them with other work right now ;-) it is supposed to work like
> this: When a membership event occurs, user-space transfers this event
> to the kernel by writing to a configfs mount.

Let me go get my airsick bag right now :-)

Let's have no magical filesystems in the core interface please.  We can always 
add some later on top of a sane base interface, assuming somebody has too 
much time on their hands, and that Andrew was busy doing something else, and 
Linus left his taste at home that day.

> Likewise, the node ids and comm links the kernel DLM uses with OCFS2
> are configured via that interface.

I am looking forward to flaming that interface should it dare to rear its ugly 
head here :-)

> If we could standarize at the kernel/user-space boundary for clustering,
> like we do for syscalls, this would IMHO be cleaner than having
> user-space wrappers.

I don't see anything wrong with wrapping a sane kernel interface with more 
stuff to make it more convenient.  Right now, the interface is a socket and a 
set of messages for the socket.   Pretty elegant, if you ask me.

There are bones to pick at the message syntax level of course.

> > Can we have a list of all the reasons that you cannot wrap your heartbeat
> > interface around cman, please?
>
> Any API can be mapped to any other API.

I meant sanely.  Let me try again: can we have a list of all the reasons that 
you cannot wrap your heartbeat interface around cman _sanely_, please.

> That wasn't the question. I was aiming at the kernel/user-space boundary
> again. 

Me too.

> > > ... which is why I asked the above questions: User-space needs to
> > > interface with the kernel to tell it the membership (if the membership
> > > is user-space driven), or retrieve it (if it is kernel driven).
> >
> > Passing things around via sockets is a powerful model.
>
> Passing a socket in to use for communication makes sense. "I want you to
> use this transport when talking to the cluster". However, that begs the
> question whether you're passing in a unicast peer-to-peer socket or a
> multicast one which reaches all of the nodes,

It was multicast last time I looked.  I heard mumblings about changing from a 
udp-derived protocol to a sctp-derived one, and I do not know if multicast 
got lost in the translation.  It would be a shame if it did.  Patrick?

> and what kind of 
> security, ordering, and reliability guarantees that transport needs to
> provide.

Security is practically nonexistent at the moment, we just keep normal users 
away from the socket.  Ordering is provided by a barrier facility at a higher 
level.  Delivery is guaranteed and knows about membership changes.

> > Of course, we could always read the patches...
>
> Reading patches is fine for understanding syntax, and spotting some
> nits. I find actual discussion with the developers to be invaluable to
> figure out the semantics and the _intention_ of the code, which takes
> much longer to deduce from the code alone; and you know, just sometimes
> the code doesn't actually reflect the intentions of the programmers who
> wrote it ;-)

Strongly agreed, and this thread is doing very well in that regard.  But we 
really, really, need people to read the patches as well, especially people 
with a strong background in clustering.

Regards,

Daniel

  reply	other threads:[~2005-04-28 20:53 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-04-25 15:11 [PATCH 0/7] dlm: overview David Teigland
2005-04-25 20:39 ` Wim Coekaerts
2005-04-25 21:09   ` Lars Marowsky-Bree
2005-04-26  5:30     ` Daniel Phillips
2005-04-27 13:56       ` Lars Marowsky-Bree
2005-04-27 20:00         ` Daniel Phillips
2005-04-27 20:20           ` Lars Marowsky-Bree
2005-04-27 22:38             ` Daniel Phillips
2005-04-28 14:57               ` Lars Marowsky-Bree
2005-04-28 20:53                 ` Daniel Phillips [this message]
2005-04-29  0:33                 ` David Lang
2005-04-29  1:49                   ` Bernd Eckenfels
2005-04-29  1:52                   ` Daniel Phillips
2005-04-29 17:13                     ` David Lang
2005-04-29 20:49                       ` Daniel Phillips
2005-05-01  3:57                       ` Theodore Ts'o
2005-05-01  4:14                         ` David Lang
2005-05-02 11:21                           ` Lars Marowsky-Bree
2005-04-28 16:25         ` David Teigland
2005-04-28 16:42           ` Lars Marowsky-Bree
2005-04-29  4:24           ` Daniel Phillips
2005-04-25 21:19   ` Andrew Morton
2005-04-26  5:46     ` David Teigland
2005-04-26  5:39   ` David Teigland
2005-04-26 18:48     ` Mark Fasheh
2005-04-26 22:34       ` Steven Dake
2005-04-27  3:32       ` David Teigland
2005-04-27 13:23       ` Lars Marowsky-Bree
2005-04-27 18:12         ` Mark Fasheh
2005-04-28 14:36           ` Lars Marowsky-Bree
2005-04-28 17:35             ` Mark Fasheh
2005-04-28 12:50         ` Stephen C. Tweedie
2005-04-25 20:52 ` Daniel Phillips

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200504281653.21060.phillips@istop.com \
    --to=phillips@istop.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lmb@suse.de \
    --cc=pcaulfie@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox