From: Daniel Phillips <phillips@istop.com>
To: Lars Marowsky-Bree <lmb@suse.de>
Cc: linux-kernel@vger.kernel.org, Patrick Caulfield <pcaulfie@redhat.com>
Subject: Re: [PATCH 0/7] dlm: overview
Date: Thu, 28 Apr 2005 16:53:20 -0400 [thread overview]
Message-ID: <200504281653.21060.phillips@istop.com> (raw)
In-Reply-To: <20050428145715.GA21645@marowsky-bree.de>
On Thursday 28 April 2005 10:57, Lars Marowsky-Bree wrote:
> On 2005-04-27T18:38:18, Daniel Phillips <phillips@istop.com> wrote:
> > Uuids's at this level are inherently bogus, unless of course you have
> > more than 2**32 cluster nodes. I don't know about you, but I do not have
> > even half that many nodes over here.
>
> This is not quite the argument. With that argument, 16 bit would be
> fine. And even then, I'd call you guilty of causing my lights to flicker
> ;-)
BlueGene is pushing the 16 bit node number boundary already, 32 bits seems
prudent. More is silly. Think of the node number as more like a PID than a
UUID.
> The argument about UUIDs goes a bit beyond that: No admin needed to
> assign them; they can stay the same even if clusters/clusters merge (in
> theory); they can be used for inter-cluster addressing too, because they
> aren't just unique within a single cluster (think clusters of clusters,
> grids etc, whatever the topology), and finally, UUID is a big enough
> blob to put all other identifiers in, be it a two bit node id, a
> nodename, 32bit IPv4 address or a 128bit IPv6.
>
> This piece is important. It defines one of the fundamental objects in
> the API.
>
> I recommend you read up on the discussions on the OCF list on this; this
> has probably been one of the hottest arguments.
Add a translation layer if you like, and submit it in the form of a user space
library service. Or have it be part of your own layer or application. There
is no compelling argument for embedding such a bloatlet to cman proper (which
is already fatter than it should be).
> "How is the kernel component
> configured which paths/IP to use" - ie, the equivalent of ifconfig/route
> for the cluster stack,
There is a config file in /etc and a (userspace) scheme for distributing the
file around the cluster (ccs - cluster configuration system).
How the configuration gets from the config file to kernel is a mystery to me
at the moment, which I will hopefully solve by reading some code later
today ;-)
> Doing this in a wrapper is one answer - in which case we'd have a
> consistent user-space API provided by shared libraries wrapping a
> specific kernel component. This places the boundary in user-space.
I believe that it is almost entirely in user space now, with the recent move
of cman to user space. I have not yet seen the new code, so I don't know the
details (this egg was hatched by Dave and Patrick).
> This seems to be a main point of contention, also applicable to the
> first question about node identifiers: What does the kernel/user-space
> boundary look like, and is this the one we are aiming to clarify?
Very much so.
> Or do we place the boundary in user-space with a specific wrapper around
> a given kernel solution.
Yes. But let's try and have a good, durable kernel solution right from the
start.
> > Since cman has now moved to user space, userspace does not tell the
> > kernel about membership,
>
> That partial sentence already makes no sense.
Partial?
> So how does the kernel
> (DLM in this case) learn about whether a node is assumed to be up or
> down if the membership is in user-space? Right! User-space must tell
> it.
By a message over a socket, as I said. This is a really nice property of
sockets: when cman moved from kernel to user space, (g)dlm was hardly
affected at all.
> For example, with OCFS2 (w/user-space membership, which it doesn't yet
> have, but which they keep telling me is trivial to add, but we're
> busying them with other work right now ;-) it is supposed to work like
> this: When a membership event occurs, user-space transfers this event
> to the kernel by writing to a configfs mount.
Let me go get my airsick bag right now :-)
Let's have no magical filesystems in the core interface please. We can always
add some later on top of a sane base interface, assuming somebody has too
much time on their hands, and that Andrew was busy doing something else, and
Linus left his taste at home that day.
> Likewise, the node ids and comm links the kernel DLM uses with OCFS2
> are configured via that interface.
I am looking forward to flaming that interface should it dare to rear its ugly
head here :-)
> If we could standarize at the kernel/user-space boundary for clustering,
> like we do for syscalls, this would IMHO be cleaner than having
> user-space wrappers.
I don't see anything wrong with wrapping a sane kernel interface with more
stuff to make it more convenient. Right now, the interface is a socket and a
set of messages for the socket. Pretty elegant, if you ask me.
There are bones to pick at the message syntax level of course.
> > Can we have a list of all the reasons that you cannot wrap your heartbeat
> > interface around cman, please?
>
> Any API can be mapped to any other API.
I meant sanely. Let me try again: can we have a list of all the reasons that
you cannot wrap your heartbeat interface around cman _sanely_, please.
> That wasn't the question. I was aiming at the kernel/user-space boundary
> again.
Me too.
> > > ... which is why I asked the above questions: User-space needs to
> > > interface with the kernel to tell it the membership (if the membership
> > > is user-space driven), or retrieve it (if it is kernel driven).
> >
> > Passing things around via sockets is a powerful model.
>
> Passing a socket in to use for communication makes sense. "I want you to
> use this transport when talking to the cluster". However, that begs the
> question whether you're passing in a unicast peer-to-peer socket or a
> multicast one which reaches all of the nodes,
It was multicast last time I looked. I heard mumblings about changing from a
udp-derived protocol to a sctp-derived one, and I do not know if multicast
got lost in the translation. It would be a shame if it did. Patrick?
> and what kind of
> security, ordering, and reliability guarantees that transport needs to
> provide.
Security is practically nonexistent at the moment, we just keep normal users
away from the socket. Ordering is provided by a barrier facility at a higher
level. Delivery is guaranteed and knows about membership changes.
> > Of course, we could always read the patches...
>
> Reading patches is fine for understanding syntax, and spotting some
> nits. I find actual discussion with the developers to be invaluable to
> figure out the semantics and the _intention_ of the code, which takes
> much longer to deduce from the code alone; and you know, just sometimes
> the code doesn't actually reflect the intentions of the programmers who
> wrote it ;-)
Strongly agreed, and this thread is doing very well in that regard. But we
really, really, need people to read the patches as well, especially people
with a strong background in clustering.
Regards,
Daniel
next prev parent reply other threads:[~2005-04-28 20:53 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-04-25 15:11 [PATCH 0/7] dlm: overview David Teigland
2005-04-25 20:39 ` Wim Coekaerts
2005-04-25 21:09 ` Lars Marowsky-Bree
2005-04-26 5:30 ` Daniel Phillips
2005-04-27 13:56 ` Lars Marowsky-Bree
2005-04-27 20:00 ` Daniel Phillips
2005-04-27 20:20 ` Lars Marowsky-Bree
2005-04-27 22:38 ` Daniel Phillips
2005-04-28 14:57 ` Lars Marowsky-Bree
2005-04-28 20:53 ` Daniel Phillips [this message]
2005-04-29 0:33 ` David Lang
2005-04-29 1:49 ` Bernd Eckenfels
2005-04-29 1:52 ` Daniel Phillips
2005-04-29 17:13 ` David Lang
2005-04-29 20:49 ` Daniel Phillips
2005-05-01 3:57 ` Theodore Ts'o
2005-05-01 4:14 ` David Lang
2005-05-02 11:21 ` Lars Marowsky-Bree
2005-04-28 16:25 ` David Teigland
2005-04-28 16:42 ` Lars Marowsky-Bree
2005-04-29 4:24 ` Daniel Phillips
2005-04-25 21:19 ` Andrew Morton
2005-04-26 5:46 ` David Teigland
2005-04-26 5:39 ` David Teigland
2005-04-26 18:48 ` Mark Fasheh
2005-04-26 22:34 ` Steven Dake
2005-04-27 3:32 ` David Teigland
2005-04-27 13:23 ` Lars Marowsky-Bree
2005-04-27 18:12 ` Mark Fasheh
2005-04-28 14:36 ` Lars Marowsky-Bree
2005-04-28 17:35 ` Mark Fasheh
2005-04-28 12:50 ` Stephen C. Tweedie
2005-04-25 20:52 ` Daniel Phillips
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200504281653.21060.phillips@istop.com \
--to=phillips@istop.com \
--cc=linux-kernel@vger.kernel.org \
--cc=lmb@suse.de \
--cc=pcaulfie@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.