All of lore.kernel.org
 help / color / mirror / Atom feed
* [Ocfs2-devel] [RFC] Integration with external clustering
@ 2005-10-18 16:52 Jeff Mahoney
  2005-10-18 17:18 ` Joel Becker
  2005-10-28 10:11 ` [Ocfs2-devel] " Lars Marowsky-Bree
  0 siblings, 2 replies; 31+ messages in thread
From: Jeff Mahoney @ 2005-10-18 16:52 UTC (permalink / raw)
  To: ocfs2-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Hey all -

We're interested in using OCFS2 with an external, userspace clustering
solution. Specifically, the heartbeat2 project from linux-ha.org.
Obviously, the internal cluster manager would still be available for
users with no interest in deploying and configuring a full cluster
manager just to use the file system.  I'd like to attempt to make the
interface as consistent as possible between the two.

The obvious mapping to an external cluster manager is to map one file
system to one cluster resource, to be managed individually. The user
space cluster manager will take over most of the cluster management
infrastructure supplied now by o2cb, including heartbeat, fencing, etc.
The node manager would still be used to coordinate DLM operations, which
would be left in-kernel. The o2cb code is pretty well structured for
this kind of integration without a lot of hacking, but there are a few
sticking points. The good news is that the infrastructure for fixing
most of them is already in place, just waiting to be used.

The existing code has a notion of one global cluster with each node
owning a particular node number and a single IP address/port. This node
number is mapped 1:1 to file system slots and DLM domain node numbers,
regardless of how many nodes are actually involved in mounting any
particular file system. A large cluster may deploy a cluster-global file
system, but also many smaller file systems to small subsets of nodes.
The smaller file systems, even though they are deployed on a small
number of nodes, still require slots for every member of the larger
cluster. If separate network connectivity is desired for the smaller
file systems, separate node numbers must be allocated in order to
utilize the alternate network, making the problem worse.

The one-cluster notion appears to be rooted in o2net, where the
assumption of a 1:1 IP Address:Node mapping is made. The node manager is
aware of multiple clusters, and even has to provide an interface to fake
the single cluster membership. o2net itself even understands that an
internode connection will be used for multiple virtual connections.

And, one of the larger issues for integration with a userspace cluster
manager is how nodes are organized and exported to userspace. Currently,
there is only one instance of a node. If a heartbeat down event is
triggered for a particular node, all file systems are told about it,
even if they don't care. What we need to integrate a userspace cluster
manager is more fine grained configuration of node membership.

I'd like to address these issues in my proposal:

Individual file systems should be represented individually, with
resources and connectivity assignable independently to each.

I'll start with an idea of what I'd like to see the configfs space look
like, since I think it will probably illustrate it best:

/config/cluster/ocfs2/<fs uuid>/<node>/
                                       ip address
                                       port
                                       fs slot
                                       local
                                       active (for userspace)
                                       heartbeat/ (for kernelspace)
                                                 block_bytes
                                                 blocks
                                                 dev
                                                 start_block

Rather than having one global cluster, each file system would be its own
cluster. Nodes would be created and destroyed as needed on a per file
system basis. The current o2net concept of a node would be replaced by
something that is specific to connectivity. The current implemention of
one connection per ip/port would stay, but rather than assume a
particular connection-node mapping at accept time, it would broker
messages later once the key has been observed in the message.

Since heartbeat and node management would end up having similar trees
with different attributes, the node and heartbeat attributes would be
unified under a single fs instance.

Obviously, modifications to the o2cb userspace tools would be required
to make this work. I think that the changes required for cluster.conf
could be minimal -- just keep the existing format and add overrides for
 file systems that want to use different slots/networks/etc.

I'm volunteering to code all this up, I just didn't want to post code
that nobody wanted.

Opinions?

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDVW+KLPWxlyuTD7IRAv5SAJ4yUID/gnGslfhu0JZzNiF+1f0OYQCfUQei
2eeyWWd6lfe9Ae8NzV8tXSI=
=xI1V
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-18 16:52 [Ocfs2-devel] [RFC] Integration with external clustering Jeff Mahoney
@ 2005-10-18 17:18 ` Joel Becker
  2005-10-18 18:03   ` Lars Marowsky-Bree
  2005-10-18 18:20   ` Jeff Mahoney
  2005-10-28 10:11 ` [Ocfs2-devel] " Lars Marowsky-Bree
  1 sibling, 2 replies; 31+ messages in thread
From: Joel Becker @ 2005-10-18 17:18 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Oct 18, 2005 at 05:56:27PM -0400, Jeff Mahoney wrote:
> I'll start with an idea of what I'd like to see the configfs space look
> like, since I think it will probably illustrate it best:
> 
> /config/cluster/ocfs2/<fs uuid>/<node>/

	If you are treating each mount as a 'cluster', the ocfs2 path
element is pretty redundant, and /config/cluster/<fs uuid> would
suffice.
	Given that heartbeat regions can and should be shared, you need
a way to describe this.  We don't have userspace doing global heartbeat
yet, but there is no reason that all OCFS2 volumes can't share one
heartbeat region (see
http://oss.oracle.com/projects/ocfs2-tools/src/branches/global-heartbeat/documentation/o2cb/).
	Have you also considered what this will or won't do to possible
interaction with the CMan stack?  We'd love OCFS2 to handle both stacks.
	Finally, have you considered the user barriers to this?  The
absolute bottom-line goal of O2CB is the minimum input by the user.  For
this to work, the user should not have to see the plethora of XML config
files that heartbeat has (or at least, used to have).  I'm talking about
the user-visible part here, not the technical reality.  The O2CB
frontend or some other piece of software can take the user's name:ip
node mapping and turn it into whatever XML it needs, but the user
shouldn't have to do anything more than ocfs2console requires them
today.

Joel


-- 

"If you took all of the grains of sand in the world, and lined
 them up end to end in a row, you'd be working for the government!"
	- Mr. Interesting

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-18 17:18 ` Joel Becker
@ 2005-10-18 18:03   ` Lars Marowsky-Bree
  2005-10-18 18:27     ` Joel Becker
  2005-10-18 18:47     ` Mark Fasheh
  2005-10-18 18:20   ` Jeff Mahoney
  1 sibling, 2 replies; 31+ messages in thread
From: Lars Marowsky-Bree @ 2005-10-18 18:03 UTC (permalink / raw)
  To: ocfs2-devel

On 2005-10-18T15:18:49, Joel Becker <Joel.Becker@oracle.com> wrote:

I'm too tired to go into the filesystem details and I have a better
understanding of the user-space parts than the fs layer ;-) I'll wade
through the rest tomorrow morning to see whether I can add something
useful to that part of the discussion too.

> 	Given that heartbeat regions can and should be shared, you need
> a way to describe this.  We don't have userspace doing global heartbeat
> yet, but there is no reason that all OCFS2 volumes can't share one
> heartbeat region (see
> http://oss.oracle.com/projects/ocfs2-tools/src/branches/global-heartbeat/documentation/o2cb/).

Good point, but I think part of Jeff's proposal is to pull-out the
heartbeating from OCFS2 into user-space, so OCFS2 no longer would
maintain its own heartbeat, and thus no heartbeat region.

Membership events (nodes up, down) would be provided to OCFS2
post-fencing.

> 	Have you also considered what this will or won't do to possible
> interaction with the CMan stack?  We'd love OCFS2 to handle both stacks.

This is hard for us to judge, but given that CMan in recent mailing list
discussions seems to be moving towards a user-space driven membership
too, it's fairly likely useable here too.

The main semantic difference I can make out between the RHAT DLM and the
one which OCFS2 uses are the way how the events are delivered across the
cluster; while OCFS2 doesn't seem to care much, RHAT's DLM requires a
"suspend all nodes - reconfigure / submit events - tell all nodes to
resume" three-phase protocol.

Our user-space stack is capable of driving both, as it happens. Funny,
actually - we have been working on the assumption that our user-space
stack needs to be able to drive all CFS implementations, and now you
bring up that you want your CFS to be driven by both stacks. ;-)

> 	Finally, have you considered the user barriers to this?  The
> absolute bottom-line goal of O2CB is the minimum input by the user.  For
> this to work, the user should not have to see the plethora of XML config
> files that heartbeat has (or at least, used to have).  I'm talking about
> the user-visible part here, not the technical reality.  The O2CB
> frontend or some other piece of software can take the user's name:ip
> node mapping and turn it into whatever XML it needs, but the user
> shouldn't have to do anything more than ocfs2console requires them
> today.

heartbeat used to have 3 straightforward config files; the XML based
configuration file (one of them, actually) is pretty new. "Plethora of
XML configuration files" certainly isn't true of heartbeat 2.x, and
never was. The XML configuration file is even automatically replicated
across cluster nodes so the user can't get them desync'ed ;-)

My goal is for the user to only add a single resource entry (a so-called
"clone" resource type) to the configuration for each OCFS2 filesystem,
and then he'd be done; the cluster would auto-generate everything else.

(As it happens, another group at Novell already has demoed this with
Novell Clustering Services; the OCFS2 configuration file is generated
from the LDAP-based cluster configuration automatically. Something
similar is what I'm aiming for here: tell us on which nodes to mount the
filesystem on (all or a subset), point us at a storage, tell us the
network to use, say "go".)

Admittedly, until heartbeat 2.x has a nice GUI, that will imply an XML
blurb to be fed to one of our tools. And yes, heartbeat 2.x's
configuration file is as simple as possible (if one ignores the XML
verbosity), but still, it is a quite powerful tool. 

But, as far as heartbeat 2.x based clusters is concerned, this will be
as easy as it can get. Trust me; I'm on the receiving end of support
cases, and I don't want it to make easier to misconfigure, and so the
less configuration possible, the better. Support has my cell phone
number and is not afraid to use it, I'm afraid... Does that make my
motivation sincerely believable? ;-)

Of course, this brings up a valid point; currently, OCFS2 can run "stand
alone" w/o any supporting user-space stack. Uhm. As RAC doesn't
interoperate with _any_ other stack, I assume this is a property which
needs to be preserved.

I'm not sure our proposal covers that case adequately; I'm thinking we
were thinking "rip & replace!" when it comes to membership/fencing, not
"either - or", but I might be wrong. This however might not be too
difficult to extend, because we only modify the heartbeat/fencing stack
- instead of ripping it out, we need to make it switchable.

So, going back to your original question, in the stand-alone mode as it
is now, the node membership would simply be "global" - all filesystems
would inherit membership from the same 'cluster'. While in our case, we
might run each filesystem with its own membership - that could be as
simple as a pointer in the per-fs data structures.

Before I write too much more crap I better get some sleep now ;-)


Sincerely,
    Lars Marowsky-Br?e <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-18 17:18 ` Joel Becker
  2005-10-18 18:03   ` Lars Marowsky-Bree
@ 2005-10-18 18:20   ` Jeff Mahoney
  2005-10-19 14:57     ` Lars Marowsky-Bree
  1 sibling, 1 reply; 31+ messages in thread
From: Jeff Mahoney @ 2005-10-18 18:20 UTC (permalink / raw)
  To: ocfs2-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Joel Becker wrote:
> On Tue, Oct 18, 2005 at 05:56:27PM -0400, Jeff Mahoney wrote:
>> I'll start with an idea of what I'd like to see the configfs space look
>> like, since I think it will probably illustrate it best:
>>
>> /config/cluster/ocfs2/<fs uuid>/<node>/
> 
> 	If you are treating each mount as a 'cluster', the ocfs2 path
> element is pretty redundant, and /config/cluster/<fs uuid> would
> suffice.

I was leaving /config/cluster as a global subsystem, since there may be
other users of the namespace in the future. I wasn't using ocfs2 as an
identifier of a cluster name, but rather the name of the particular
cluster subsystem.

> 	Given that heartbeat regions can and should be shared, you need
> a way to describe this.  We don't have userspace doing global heartbeat
> yet, but there is no reason that all OCFS2 volumes can't share one
> heartbeat region (see
> http://oss.oracle.com/projects/ocfs2-tools/src/branches/global-heartbeat/documentation/o2cb/).

As Lars mentioned in his reply, part of my plan is to remove the
heartbeating aspect completely when the userspace clustering is enabled.
The node manager would be fed membership information from userspace,
where it knows what the network topology and storage configuration is.
Quorum and fencing decisions would be made in userspace using the
algorithms already implemented and tested.

Since there are any number of possible topologies, I'd like the
granularity for controlling membership to be at the per-file system level.

> 	Have you also considered what this will or won't do to possible
> interaction with the CMan stack?  We'd love OCFS2 to handle both stacks.

I'm not really familiar with the CMan stack, but I was hoping that the
configuration I described would be easy enough for any userspace cluster
manager to handle. Lars and Andrew Beekhof are working with me on the
cluster side of things, so they'd be more familiar with the details here.

> 	Finally, have you considered the user barriers to this?  The
> absolute bottom-line goal of O2CB is the minimum input by the user.  For
> this to work, the user should not have to see the plethora of XML config
> files that heartbeat has (or at least, used to have).  I'm talking about
> the user-visible part here, not the technical reality.  The O2CB
> frontend or some other piece of software can take the user's name:ip
> node mapping and turn it into whatever XML it needs, but the user
> shouldn't have to do anything more than ocfs2console requires them
> today.

I'm not sure if we're on the same page here. I'm not proposing a
complete replacement of o2cb. I'm proposing we keep o2cb and supplement
it with the ability to handle input from a userspace cluster manager,
regardless of what that cluster manager is. The implementation of that
cluster manager should be completely outside the scope of the file
system. I'm envisioning the cluster manager choice (userspace or o2cb)
as a compile time option, but I suppose it could be a module load time
option as well.

The differences to o2cb should be minimal, and absolutely hidden from
the user unless they desire more functionality, such as alternate
network paths.

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDVYQhLPWxlyuTD7IRAijgAJ9GKB9qk1nliD39da4SJFKzBIYs+QCeJ3nX
tWMar1L44rcOBlw1yYx8g1Y=
=3Q1D
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-18 18:03   ` Lars Marowsky-Bree
@ 2005-10-18 18:27     ` Joel Becker
  2005-10-18 18:50       ` Mark Fasheh
  2005-10-19  8:26       ` Lars Marowsky-Bree
  2005-10-18 18:47     ` Mark Fasheh
  1 sibling, 2 replies; 31+ messages in thread
From: Joel Becker @ 2005-10-18 18:27 UTC (permalink / raw)
  To: ocfs2-devel

On Wed, Oct 19, 2005 at 01:03:23AM +0200, Lars Marowsky-Bree wrote:
> Good point, but I think part of Jeff's proposal is to pull-out the
> heartbeating from OCFS2 into user-space, so OCFS2 no longer would
> maintain its own heartbeat, and thus no heartbeat region.

	Duh, right.  Then the heartbeat part of the hierarchy isn't even
useful to OCFS2.  But you will need to come up with some method
(netlink, in-kernel api, whatever) for OCFS2 to register itself with
heartbeat for events.  I have to assume this API already exists, becuase
heartbeat consumers would need it.

> Membership events (nodes up, down) would be provided to OCFS2
> post-fencing.

	I believe (Mark, correct me if I'm wrong) that OCFS2 merely
requires the standard "DLM must find out first" protocol.  That is, the
DLM must be able to lock out all locking changes before the filesystem
tries to recover anything.  I believe GFS and even VMS CFS rely on this
property.

> heartbeat used to have 3 straightforward config files; the XML based
> configuration file (one of them, actually) is pretty new. "Plethora of
> XML configuration files" certainly isn't true of heartbeat 2.x, and
> never was. The XML configuration file is even automatically replicated
> across cluster nodes so the user can't get them desync'ed ;-)

	"Plethora of configuration files" is my way of saying "last time
I tried heartbeat, it wasn't _IMMEDIATELY_ obvious what I needed to edit
to get it going."  A lot of this can be handled with wrapper software.
For example, in OCFS2, ocfs2console will create cluster.conf for you and
populate it.  All you really, really need is your name:ip pairs.  So, in
the old heartbeat case, I (the sysadmin) really shouldn't have to be
editing the fencing config file if I am going to be using the default.
I shouldn't have to know about it (lord knows, I didn't back when I
tried, and it caused much consternation).
	One of the design docs on my plate for OCFS2 is the "Simple User
Experience" doc.  The idea is to design the "easy, bullet-proof init" of
a single mount.  Once written, any change that would add a step to the
document would be rejected unless deemed absolutely necessary by a
unanimous vote, or the like.  You see where I'm going?  The concern here
is not the complexity for a know-what-I'm-doing admin at a big company,
but the most basic install for the most basic system.
	Of course, if we're figuring on leaving O2CB for that person,
and having heartbeat2 as a 'more fancy' user, that's a whole 'nother
story.  Then it's your problem :-)

> Of course, this brings up a valid point; currently, OCFS2 can run "stand
> alone" w/o any supporting user-space stack. Uhm. As RAC doesn't
> interoperate with _any_ other stack, I assume this is a property which
> needs to be preserved.

	You bring up an interesting point.  It's not the lack of
userspace stack we care about.  It's the ease of the stack.  Assume that
Joe User only wants to run RAC on OCFS2, he could care less about O2CB,
heartbeat2, or CMan.  What we care about is that:

1) It is mind-numbingly transparent, easy, and obvious.
2) It is available on all of our supported platforms (provided by the
   platform or by us).

	Today, ocfs2console provides (1), O2CB provides (2).

Joel

-- 

Life's Little Instruction Book #20

	"Be forgiving of yourself and others."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-18 18:03   ` Lars Marowsky-Bree
  2005-10-18 18:27     ` Joel Becker
@ 2005-10-18 18:47     ` Mark Fasheh
  2005-10-19  8:35       ` Lars Marowsky-Bree
  1 sibling, 1 reply; 31+ messages in thread
From: Mark Fasheh @ 2005-10-18 18:47 UTC (permalink / raw)
  To: ocfs2-devel

Hi Lars,

On Wed, Oct 19, 2005 at 01:03:23AM +0200, Lars Marowsky-Bree wrote:
> > 	Have you also considered what this will or won't do to possible
> > interaction with the CMan stack?  We'd love OCFS2 to handle both stacks.
> 
> This is hard for us to judge, but given that CMan in recent mailing list
> discussions seems to be moving towards a user-space driven membership
> too, it's fairly likely useable here too.
Right, and we're only interested in the new (userspace only) CMan, so no
need to consider the kernel code.

> My goal is for the user to only add a single resource entry (a so-called
> "clone" resource type) to the configuration for each OCFS2 filesystem,
> and then he'd be done; the cluster would auto-generate everything else.
Do you mean that the user would have to add a configuration entry for every
single OCFS2 mount on their machine? Unless that's done automatically and
transparently (perhaps from mount.ocfs2), it's pretty much a non starter.
We've always wanted the users path (once the cluster stack is installed and
configured) to be as easy as a local filesystem: mkfs.ocfs2 /dev/foo; mount
/dev/foo /ocfs2.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-18 18:27     ` Joel Becker
@ 2005-10-18 18:50       ` Mark Fasheh
  2005-10-19  8:26       ` Lars Marowsky-Bree
  1 sibling, 0 replies; 31+ messages in thread
From: Mark Fasheh @ 2005-10-18 18:50 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Oct 18, 2005 at 04:27:52PM -0700, Joel Becker wrote:
> > Membership events (nodes up, down) would be provided to OCFS2
> > post-fencing.
> 
> 	I believe (Mark, correct me if I'm wrong) that OCFS2 merely
> requires the standard "DLM must find out first" protocol.  That is, the
> DLM must be able to lock out all locking changes before the filesystem
> tries to recover anything.  I believe GFS and even VMS CFS rely on this
> property.
No, it's the other way around - the file system requires notification before
the DLM so that it can mark itself as requiring recovery *before* the DLM
has a chance to start giving away locks that were previously protecting
resources in use by the other node.
You're correct of course that GFS and other cluster file systems require the
same behavior.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-18 18:27     ` Joel Becker
  2005-10-18 18:50       ` Mark Fasheh
@ 2005-10-19  8:26       ` Lars Marowsky-Bree
  2005-10-19 12:49         ` Joel Becker
  2005-10-19 16:30         ` Jeff Mahoney
  1 sibling, 2 replies; 31+ messages in thread
From: Lars Marowsky-Bree @ 2005-10-19  8:26 UTC (permalink / raw)
  To: ocfs2-devel

On 2005-10-18T16:27:52, Joel Becker <Joel.Becker@oracle.com> wrote:

> 	Duh, right.  Then the heartbeat part of the hierarchy isn't even
> useful to OCFS2.

Actually a good point. I don't think the heartbeat hierarchy is needed
if driven by a user-space membership.

> But you will need to come up with some method (netlink, in-kernel api,
> whatever) for OCFS2 to register itself with heartbeat for events.  I
> have to assume this API already exists, becuase heartbeat consumers
> would need it.

We're thinking from opposite directions, actually. 

OCFS2 doesn't register with us in this model; _we_ drive OCFS2 and
provide it with the events; we manage it, so we know it's there.

> > Membership events (nodes up, down) would be provided to OCFS2
> > post-fencing.
> 	I believe (Mark, correct me if I'm wrong) that OCFS2 merely
> requires the standard "DLM must find out first" protocol.  That is, the
> DLM must be able to lock out all locking changes before the filesystem
> tries to recover anything.  I believe GFS and even VMS CFS rely on this
> property.

Our Cluster Resource Manager models the dependencies between the various
components, ie DLM to CFS in this case, and supplies the events in the
correct order to them.

> 	Of course, if we're figuring on leaving O2CB for that person,
> and having heartbeat2 as a 'more fancy' user, that's a whole 'nother
> story.  Then it's your problem :-)

That's probably the best way to approach this right now ;-)


Sincerely,
    Lars Marowsky-Br?e <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-18 18:47     ` Mark Fasheh
@ 2005-10-19  8:35       ` Lars Marowsky-Bree
  0 siblings, 0 replies; 31+ messages in thread
From: Lars Marowsky-Bree @ 2005-10-19  8:35 UTC (permalink / raw)
  To: ocfs2-devel

On 2005-10-18T16:47:57, Mark Fasheh <mark.fasheh@oracle.com> wrote:

> Do you mean that the user would have to add a configuration entry for every
> single OCFS2 mount on their machine? Unless that's done automatically and
> transparently (perhaps from mount.ocfs2), it's pretty much a non starter.

> We've always wanted the users path (once the cluster stack is
> installed and configured) to be as easy as a local filesystem:
> mkfs.ocfs2 /dev/foo; mount /dev/foo /ocfs2.

Well, I think we're approximately on the same page here.

Your "mkfs ; mount" approach works because you are looking at a special
case; you just have the filesystem, nothing else it depends on, nothing
else depends on it.

I think you expect the mount command as above to register the filesystem
with the cluster manager automatically, while our approach is sort-of
the other way around.

The admin has to add the configuration entry to the CRM because in our
model, the filesystem mount is a cluster resource; if he wants us to be
aware of it (which means that we know to mount it before trying to start
Oracle on that node, for example, to give it membership events etc -
for that we need to not only know about the filesystem itself, but also
about how it depends on other components (SAN reservations, fencing...)
and how other components depend on it), he needs to tell us.

Yes, the configuration is a bit of a hassle. The admin in this case has
to model the relationship of the clustered resources to one another. We
need to make this easier as far as that is possible.

And yes, auto-discovery of these relations would rock; if we could query
what resources are active on the node, what they depend on, ... we could
create a cluster configuration automatically. CIM/WBEM might actually go
into that direction. But right now, I consider this a research problem
still.


Sincerely,
    Lars Marowsky-Br?e <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-19  8:26       ` Lars Marowsky-Bree
@ 2005-10-19 12:49         ` Joel Becker
  2005-10-19 17:41           ` Jeff Mahoney
  2005-10-19 16:30         ` Jeff Mahoney
  1 sibling, 1 reply; 31+ messages in thread
From: Joel Becker @ 2005-10-19 12:49 UTC (permalink / raw)
  To: ocfs2-devel

On Wed, Oct 19, 2005 at 03:26:24PM +0200, Lars Marowsky-Bree wrote:
> Actually a good point. I don't think the heartbeat hierarchy is needed
> if driven by a user-space membership.

	Well, if it is information that some kernel component would
want/need, then sure it would live in configfs somewhere, but not under
OCFS2.
	Is your heartbeat loop (the actual does-the-beating code) in
kenrel or userspace?  I thought that was your only kernel component at
one point, because of scheduling issues with userspace, but please
correct me.

> OCFS2 doesn't register with us in this model; _we_ drive OCFS2 and
> provide it with the events; we manage it, so we know it's there.

	Well, there needs to be some entry point by which OCFS2 recieves
events.  We don't care how it is done, I guess, but it needs the usual
async, locking-safe, yadayada that everyone expects.

> > 	Of course, if we're figuring on leaving O2CB for that person,
> > and having heartbeat2 as a 'more fancy' user, that's a whole 'nother
> > story.  Then it's your problem :-)
> 
> That's probably the best way to approach this right now ;-)

	Put succinctly: We absolutely require the minimal mkfs; mount;
paradigm be available for our users.  We will not settle for less.
	How that is done we don't much care.  So if your system can't
provide it, O2CB will continue to do so.  We'll be happy to help
integrate with your stuff as well, as long as it doesn't compromise
O2CB.  Then, if someone is already using your manager, they can just use
it with OCFS2.  But anyone not already using your manager can just
mkfs;mount; with O2CB.

Joel

-- 

 "I'm living so far beyond my income that we may almost be said
 to be living apart."
         - e e cummings

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-18 18:20   ` Jeff Mahoney
@ 2005-10-19 14:57     ` Lars Marowsky-Bree
  2005-10-19 17:42       ` David Teigland
  0 siblings, 1 reply; 31+ messages in thread
From: Lars Marowsky-Bree @ 2005-10-19 14:57 UTC (permalink / raw)
  To: ocfs2-devel

On 2005-10-18T19:24:18, Jeff Mahoney <jeffm@suse.com> wrote:

> > 	Have you also considered what this will or won't do to possible
> > interaction with the CMan stack?  We'd love OCFS2 to handle both stacks.
> I'm not really familiar with the CMan stack, but I was hoping that the
> configuration I described would be easy enough for any userspace cluster
> manager to handle. Lars and Andrew Beekhof are working with me on the
> cluster side of things, so they'd be more familiar with the details here.

David Teigland is subscribed to this list according to mailman, so,
David, what are your thoughts? ;-)


Sincerely,
    Lars Marowsky-Br?e <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-19  8:26       ` Lars Marowsky-Bree
  2005-10-19 12:49         ` Joel Becker
@ 2005-10-19 16:30         ` Jeff Mahoney
  2005-10-20  5:24           ` Lars Marowsky-Bree
  2005-10-20  6:04           ` Andrew Beekhof
  1 sibling, 2 replies; 31+ messages in thread
From: Jeff Mahoney @ 2005-10-19 16:30 UTC (permalink / raw)
  To: ocfs2-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Lars Marowsky-Bree wrote:
> Actually a good point. I don't think the heartbeat hierarchy is needed
> if driven by a user-space membership.

If we're to provide membership information on a per file system basis,
we'll need some way to distinguish between them. The hierarchy may not
matter in the case of the o2cb global heartbeat, but it does for the
userspace heartbeat.

> OCFS2 doesn't register with us in this model; _we_ drive OCFS2 and
> provide it with the events; we manage it, so we know it's there.

No, OCFS2 needs to register with userspace.

The userspace heart beat should only care about nodes where the file
system is actually mounted. Otherwise, if a random node that has the
ability to mount a file system but doesn't actually have it mounted
could cause heartbeat events across the cluster. That shouldn't happen.

In order to do this, I think that at mount time, we should call out to
user space to tell it to start caring about this node for a particular
heart beat group. When the file system is umounted, we call out again
and tell it to stop caring.

Only using the cluster manager to mount or umount a file system isn't an
acceptable use pattern. OCFS2 shouldn't become so special cased that
it's a pain to work with. Ideally, it should only be slightly more
difficult to configure than o2cb is now. mount -t ocfs2 should work with
no additional effort for the common case. There should be a default
OCFS2 configuration that we can use for common mounts, and then special
cased configurations for more advanced topologies. We can pass out the
UUID as a parameter; I don't think this should be too difficult to do.

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDVrvoLPWxlyuTD7IRAibPAKCMUrfsy4WMUBDpyZ0BKqNy9KcNjwCggxE1
bZbDDewALUQBLnswO+8Mnio=
=uDuP
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-19 12:49         ` Joel Becker
@ 2005-10-19 17:41           ` Jeff Mahoney
  2005-10-20  7:39             ` Lars Marowsky-Bree
  0 siblings, 1 reply; 31+ messages in thread
From: Jeff Mahoney @ 2005-10-19 17:41 UTC (permalink / raw)
  To: ocfs2-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Joel Becker wrote:
> On Wed, Oct 19, 2005 at 03:26:24PM +0200, Lars Marowsky-Bree wrote:
>> Actually a good point. I don't think the heartbeat hierarchy is needed
>> if driven by a user-space membership.
> 
> 	Well, if it is information that some kernel component would
> want/need, then sure it would live in configfs somewhere, but not under
> OCFS2.

Yes and no. I'd like to revise my original proposal here. I think that
my proposed node hierarchy changes would be required to provide the
alternate network paths and such, but would the "active" attribute
really be required? Initially, I wanted to make the distinction between
a node's membership being removed and the node simply going down. On
further thought, I don't think this distinction actually needs to be made.

OCFS2 is aware if a node down event is expected or not by the umount
map. Therefore, it seems simple enough to allow a local node's
membership to imply a heartbeat presence.

So, how about the following:

The existing heartbeat directory structure can stay as it is. It will
only be available when o2cb is active.

/configfs/<cluster>/<uuid>/<node>/
                                  ip address
                                  port
                                  local
                                  fs slot
                                  node number

The user space heartbeat will create and remove the <node> directory on
up/down events. OCFS2 will take appropriate action as expected with the
current heartbeat implementation. I intend to simply queue the events as
they are now and use the existing callback infrastructure to distribute
them.

> 	Put succinctly: We absolutely require the minimal mkfs; mount;
> paradigm be available for our users.  We will not settle for less.
> 	How that is done we don't much care.  So if your system can't
> provide it, O2CB will continue to do so.  We'll be happy to help
> integrate with your stuff as well, as long as it doesn't compromise
> O2CB.  Then, if someone is already using your manager, they can just use
> it with OCFS2.  But anyone not already using your manager can just
> mkfs;mount; with O2CB.

I totally agree that mkfs;mount should work. It's what users expect to
work for a file system, and we don't want OCFS2 to be special cased so
much that nobody wants to deal with it. Users with more advanced
topologies can handle the additional configuration load.

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDVsxpLPWxlyuTD7IRAkauAKCS+C9vzh2T8t/vI8ww682ATpzYjQCfX329
I14cD4TccpuEyek4gELeu3I=
=d6GW
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-19 14:57     ` Lars Marowsky-Bree
@ 2005-10-19 17:42       ` David Teigland
  2005-10-20  5:58         ` Lars Marowsky-Bree
  0 siblings, 1 reply; 31+ messages in thread
From: David Teigland @ 2005-10-19 17:42 UTC (permalink / raw)
  To: ocfs2-devel

On Wed, Oct 19, 2005 at 09:56:54PM +0200, Lars Marowsky-Bree wrote:
> On 2005-10-18T19:24:18, Jeff Mahoney <jeffm@suse.com> wrote:
> 
> > > 	Have you also considered what this will or won't do to possible
> > > interaction with the CMan stack?  We'd love OCFS2 to handle both stacks.
> > I'm not really familiar with the CMan stack, but I was hoping that the
> > configuration I described would be easy enough for any userspace cluster
> > manager to handle. Lars and Andrew Beekhof are working with me on the
> > cluster side of things, so they'd be more familiar with the details here.
> 
> David, what are your thoughts? ;-)

Just catching up on this after being away for a while.  Not only has cman
moved entirely to user space, but a large portion of gfs (everything
related to cman and clustering) has also moved to user space.  So, a user
space gfs daemon (call it gfs_clusterd) interacts with the other user
space clustering systems and drives the bits of gfs in the kernel. 

Here are the main "knobs" gfs_clusterd uses to control a specific fs:

/sys/fs/gfs2/<fs_name>/lock_module/
                                   block
                                   mounted
                                   jid
                                   recover

When a gfs fs is mounted on a node:

. the mount process enters gfs-kernel
. the mount process sends a simple uevent to gfs_clusterd
. the mount process waits for gfs_clusterd to write 1 to /sys/.../mounted

. gfs_clusterd gets the mount uevent from gfs-kernel
. gfs_clusterd joins the cluster-wide "group" that represents the
  specific fs being mounted [1]
. gfs_clusterd tells gfs-kernel which journal the local node will use by
  writing the journal id to /sys/.../jid
. gfs_clusterd tells the mount process it can continue by writing 1
  to /sys/.../mounted
. the local node now has the fs mounted

[1] As part of the node being added to the group, gfs_clusterd on the
nodes that already have the fs mounted is notified of the new mounter for
the fs.

When a node that has a gfs file system mounted fails:

. the cluster infrastructure notifies gfs_clusterd that a node failed
. gfs_clusterd writes 1 to /sys/../block to block new lock requests from gfs
. the infrastructure notifies gfs_clusterd that gfs_clusterd is "stopped"
  (and therefore blocked) on all mounters
. gfs_clusterd tells gfs-kernel to recover the journal of the failed
  node by writing the journal id of the failed node to /sys/.../recover
. when journal recovery is done, gfs-kernel sends a uevent to gfs_clusterd
. gfs_clusterd tells gfs-kernel to continue normal operation by
  writing 0 to /sys/.../block

That's a simplified example of how we control gfs from user space.  Our
dlm is controlled in a similar way by the dlm_controld daemon.  Think of
the user daemon (gfs_clusterd) and kernel module (gfs.ko) as two parts of
a single system and sysfs/configfs as more of an internal communication
path between the two parts, not so much an external API.

It's the interfaces the two user daemons have with the cluster
infrastructure (membership/group manager, crm, etc) that would need to be
studied to use gfs in other environments.  None of this is easy, but
there's far more flexibility working it out in user space than in the
kernel.  The same may be the case for ocfs.

Dave

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-19 16:30         ` Jeff Mahoney
@ 2005-10-20  5:24           ` Lars Marowsky-Bree
  2005-10-20 10:03             ` Joel Becker
  2005-10-20  6:04           ` Andrew Beekhof
  1 sibling, 1 reply; 31+ messages in thread
From: Lars Marowsky-Bree @ 2005-10-20  5:24 UTC (permalink / raw)
  To: ocfs2-devel

On 2005-10-19T17:34:32, Jeff Mahoney <jeffm@suse.com> wrote:

> > Actually a good point. I don't think the heartbeat hierarchy is needed
> > if driven by a user-space membership.
> If we're to provide membership information on a per file system basis,
> we'll need some way to distinguish between them. The hierarchy may not
> matter in the case of the o2cb global heartbeat, but it does for the
> userspace heartbeat.

User-spaces knows which membership it needs to supply to which
filesystem UUID anyway though. The heartbeat/ subdirectory (if that's
what we are talking about) only matters for in-kernel membership as like
now.

> > OCFS2 doesn't register with us in this model; _we_ drive OCFS2 and
> > provide it with the events; we manage it, so we know it's there.
> No, OCFS2 needs to register with userspace.
> 
> The userspace heart beat should only care about nodes where the file
> system is actually mounted. Otherwise, if a random node that has the
> ability to mount a file system but doesn't actually have it mounted
> could cause heartbeat events across the cluster. That shouldn't happen.

You're thinking like a filesystem-and-nothing-else guy, can't blame you
for that ;-)

> In order to do this, I think that at mount time, we should call out to
> user space to tell it to start caring about this node for a particular
> heart beat group. When the file system is umounted, we call out again
> and tell it to stop caring.
> 
> Only using the cluster manager to mount or umount a file system isn't an
> acceptable use pattern. OCFS2 shouldn't become so special cased that
> it's a pain to work with.

This is the only way of managing cluster resources. Cluster resources
must be solely controlled via the CRM. Just like now, cluster users are
quite used to that, it's a basic property of all existing cluster
stacks: the user is for example NOT allowed to mount a non-shared
filesystem just because he sees the SAN; he needs to use the CRM, or
various constraints can no longer be guaranteed.

A common model how resources can register with a CRM doesn't exist yet.
As I pointed at the future: CIM/WBEM might one day offer something like
this, but we aren't there yet. Random subsystems registering with us w/o
them telling us their dependencies just doesn't work.

"mount the filesystem as a normal filesystem" is a use pattern which
works if your filesystem is the only thing which the cluster manages.
But what does it depend on? Is the node allowed to mount the filesystem
at all, based on the current active policy rules? Does it require
fencing? ...

And most especially, I don't want this event to come from the kernel to
user-space, I think.

What might just be about possible is that "mount" is patched to know
that it has to go through some special steps to "mount" an OCFS2 fs;
namely, not do it itself directly, but tell the CRM "Hey, user wants
this mounted, see what you can do".

As you're not allowed to mount the filesystem if you're not a proper
cluster member, the requirement for the cluster stack to be running
isn't anything new.

> There should be a default OCFS2 configuration that we can use for
> common mounts, and then special cased configurations for more advanced
> topologies. We can pass out the UUID as a parameter; I don't think
> this should be too difficult to do.

The "default" case is that the filesystem is mounted on all nodes (as
part of the cluster startup) all the time, and all nodes are equal.
Still, this requires the cluster to be told. And it needs to be told
that the filesystem needs to be started before the application etc.

See, the user has to tell us about the applications and other services
already anyway; as part of that configuration, he also tells us about
the filesystem. That's all consistent. I don't want to special case
OCFS2; the mount extension pointed at above might be a path to get both
approaches joined.


Sincerely,
    Lars Marowsky-Br?e <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-19 17:42       ` David Teigland
@ 2005-10-20  5:58         ` Lars Marowsky-Bree
  2005-10-20  9:45           ` David Teigland
  0 siblings, 1 reply; 31+ messages in thread
From: Lars Marowsky-Bree @ 2005-10-20  5:58 UTC (permalink / raw)
  To: ocfs2-devel

On 2005-10-19T17:42:21, David Teigland <teigland@redhat.com> wrote:

> Just catching up on this after being away for a while.  Not only has cman
> moved entirely to user space, but a large portion of gfs (everything
> related to cman and clustering) has also moved to user space.  So, a user
> space gfs daemon (call it gfs_clusterd) interacts with the other user
> space clustering systems and drives the bits of gfs in the kernel. 

Morning David, thanks for your insights!

> Here are the main "knobs" gfs_clusterd uses to control a specific fs:
> 
> /sys/fs/gfs2/<fs_name>/lock_module/
>                                    block
>                                    mounted
>                                    jid
>                                    recover
> 
> When a gfs fs is mounted on a node:
> 
> . the mount process enters gfs-kernel
> . the mount process sends a simple uevent to gfs_clusterd
> . the mount process waits for gfs_clusterd to write 1 to /sys/.../mounted
> 
> . gfs_clusterd gets the mount uevent from gfs-kernel
> . gfs_clusterd joins the cluster-wide "group" that represents the
>   specific fs being mounted [1]
> . gfs_clusterd tells gfs-kernel which journal the local node will use by
>   writing the journal id to /sys/.../jid
> . gfs_clusterd tells the mount process it can continue by writing 1
>   to /sys/.../mounted
> . the local node now has the fs mounted

The /sys/.../mounted flag seems to be exactly the thing I don't like.
Sigh. ;-) It seems, however, that there's actual demand for this
functionality.

OK. I'll now make a 180 degree turn and say that we need to do this and
agree to figure out how ;-)

Ignoring the specific steps gfs_clusterd performs (which would be
different on our stack, of course), the main issue I'm not liking this
much is the hoop through kernel space for the uevent and the
notification.

(Also, your outline doesn't contain the possibility that the cluster
says "No, you CAN'T mount this. Rejected!" - is this for ease of
describing the case, or how is that implemented? Writing "2" to the
.../mounted flag or something?)

I'd much rather have all of this done in user-space prior to the actual
mount syscall being issued.

"mount" would need a generic hook by which it could call into the
cluster stuff (whatever it is) to a) have it authorize the mount, b)
_know_ about the mount, c) prepare the mount if needed - by bringing
online all pre-requisites on that node et cetera. 

Actually this is quite powerful. This hook could also be used for _non
cluster filesystems_ - the cluster could deny mounting of filesystems on
shared storage which are active on another node.

Same for umount. A nice side-effect for the umount would be that it
could actually ask the cluster "hey, admin wants this unmounted, stop
everything which depends on it on that node too! Migrate!".

Two issues:

- This is a special case for filesystems. It'd be nice if we had a
  generic mechanism by which this also worked for all kinds of
  resources; as I've said, CIM seems to be going into that direction.
  Then also this could be unified with the mechanism for example the
  clustered LVMs use; the C-LVM2 already has such a mechanism internally
  too.
  
  However, filesystems are a fairly important case, and when we have
  more than one implementation of this mechanism (LVM + filesystem)
  we'll have a better idea of what such a generic mechanism would look
  like.

- Trapping this in user-space of course isn't as powerful as
  intercepting each and every mount syscall(); somebody calling directly
  would get a reject. This however seems acceptable to me?


Sincerely,
    Lars Marowsky-Br?e <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-19 16:30         ` Jeff Mahoney
  2005-10-20  5:24           ` Lars Marowsky-Bree
@ 2005-10-20  6:04           ` Andrew Beekhof
  1 sibling, 0 replies; 31+ messages in thread
From: Andrew Beekhof @ 2005-10-20  6:04 UTC (permalink / raw)
  To: ocfs2-devel

I'm kinda new here, so I apologize in advance if I have insulted  
anyone's intelligence below.
My specialty is in hb2 (specifically the CRM) and not yet in OCFS2,  
so I'm also happy to be corrected if I've missed the point or said  
something dumb there too.

Moving on...

On Oct 19, 2005, at 11:34 PM, Jeff Mahoney wrote:


> Lars Marowsky-Bree wrote:
>
>
>> Actually a good point. I don't think the heartbeat hierarchy is  
>> needed
>> if driven by a user-space membership.
>>
>>
>
> If we're to provide membership information on a per file system basis,
> we'll need some way to distinguish between them. The hierarchy may not
> matter in the case of the o2cb global heartbeat, but it does for the
> userspace heartbeat.
>
>
>
>> OCFS2 doesn't register with us in this model; _we_ drive OCFS2 and
>> provide it with the events; we manage it, so we know it's there.
>>
>>
>
> No, OCFS2 needs to register with userspace.
>
> The userspace heart beat should only care about nodes where the file
> system is actually mounted. Otherwise, if a random node that has the
> ability to mount a file system but doesn't actually have it mounted
> could cause heartbeat events across the cluster. That shouldn't  
> happen.
>

I believe the idea here is that "not mounted" == "resource not running"

So like you said, if a node that could but doesn't have the  
filesystem mounted fails... then the filesystem will not hear  
anything about it.

There is also the related point that if it is mounted - we must know  
about it.
Having resources we're supposed to be managing active without us  
knowing is highly evil because it might violate the currently active  
cluster policy.


> In order to do this, I think that at mount time, we should call out to
> user space to tell it to start caring about this node for a particular
> heart beat group. When the file system is umounted, we call out again
> and tell it to stop caring.
>

As I mentioned to Jeff last night, we _could_ make something like  
this work.

However, thinking more about it, I dont think we should.

It seems to me that there are two use cases here and I believe that  
trying to bash one into the form of the other is a mistake.


The first is where the filesystem (often via the user) is in  
control.  Thats where you want mount to work transparently - updating  
the cluster behind the scenes.


The second is where the cluster is in control.  Here a transparent  
mount command is a hinderance because you end up calling back to the  
cluster for no reason or benefit - in fact you end up creating a  
loop.  Not that breaking it isnt possible - but logically it makes no  
sense to create it in the first place.


In case people are wondering why you would even want the cluster to  
be in control... its because it knows more than the filesystem does.

For example the cluster knows that Apache on nodeX failed so we need  
to migrate it (and the filesystem it requires) to nodeY.
Or it might be 7am which means the nighty auditing run is complete  
and we don't need as many nodes to share the load.

In both cases the node is healthy, the filesystem is healthy - but we  
want to stop/move the filesystem anyway.

Alternatively the filesystem may have failed but there are some  
unrelated resources that can/must be safely migrated before the node  
is fenced.


My personal view (others may disagree) is that hb2 resource  
management (the CRM) should stay out of the first scenario.  Use the  
messaging, the membership, or the fencing pieces by all means... but  
not the CRM.
For that situation it doesn't really add anything over what already  
exists and in fact it makes things worse.
Its worse because you now have two brains - both of which want to  
enforce their will on the cluster, both with different policies and  
perspectives.
I don't see how that can ever turn out well.


> Only using the cluster manager to mount or umount a file system  
> isn't an
> acceptable use pattern.
>

I don't think that its an either/or here.  We need to be able to  
support the second scenario without impacting the first.... and then  
let the user decide what fits their needs best.


> OCFS2 shouldn't become so special cased that
> it's a pain to work with. Ideally, it should only be slightly more
> difficult to configure than o2cb is now. mount -t ocfs2 should work  
> with
> no additional effort for the common case. There should be a default
> OCFS2 configuration that we can use for common mounts, and then  
> special
> cased configurations for more advanced topologies. We can pass out the
> UUID as a parameter; I don't think this should be too difficult to do.
>
> - -Jeff
>

--
Andrew Beekhof

"Would the last person to leave please turn out the enlightenment?" -  
TISM

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-19 17:41           ` Jeff Mahoney
@ 2005-10-20  7:39             ` Lars Marowsky-Bree
  0 siblings, 0 replies; 31+ messages in thread
From: Lars Marowsky-Bree @ 2005-10-20  7:39 UTC (permalink / raw)
  To: ocfs2-devel

On 2005-10-19T18:44:57, Jeff Mahoney <jeffm@suse.com> wrote:

> So, how about the following:
> 
> The existing heartbeat directory structure can stay as it is. It will
> only be available when o2cb is active.
> 
> /configfs/<cluster>/<uuid>/<node>/
>                                   ip address
>                                   port
>                                   local
>                                   fs slot
>                                   node number
> 
> The user space heartbeat will create and remove the <node> directory on
> up/down events. OCFS2 will take appropriate action as expected with the
> current heartbeat implementation. I intend to simply queue the events as
> they are now and use the existing callback infrastructure to distribute
> them.

That's a good direction to move into, and right for OCFS2/DLM.

Two things, though:

1. DLM should be decoupled from the filesystem. The DLM should be
useable without OCFS2.

2. I'd encapsulate the network details somehow because in the future
this might support more than just plain TCP, maybe SCTP or several
links. How about:

	/configfs/<cluster>/<uuid>/<node>/link{0,1,...}/{ip,port,proto}

?

> I totally agree that mkfs;mount should work. It's what users expect to
> work for a file system, and we don't want OCFS2 to be special cased so
> much that nobody wants to deal with it. Users with more advanced
> topologies can handle the additional configuration load.

Well, conceded. The question seems to now focus on how to hook this up
with the cluster stack. I still like the idea of modifying mount(8)
better than changing mount(2), though I might be convinced.


Sincerely,
    Lars Marowsky-Br?e <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-20  5:58         ` Lars Marowsky-Bree
@ 2005-10-20  9:45           ` David Teigland
  0 siblings, 0 replies; 31+ messages in thread
From: David Teigland @ 2005-10-20  9:45 UTC (permalink / raw)
  To: ocfs2-devel

On Thu, Oct 20, 2005 at 12:57:55PM +0200, Lars Marowsky-Bree wrote:
> (Also, your outline doesn't contain the possibility that the cluster
> says "No, you CAN'T mount this. Rejected!" - is this for ease of
> describing the case, or how is that implemented? Writing "2" to the
> .../mounted flag or something?)

-1 I think

> I'd much rather have all of this done in user-space prior to the actual
> mount syscall being issued.
> 
> "mount" would need a generic hook by which it could call into the
> cluster stuff (whatever it is) to a) have it authorize the mount, b)
> _know_ about the mount, c) prepare the mount if needed - by bringing
> online all pre-requisites on that node et cetera. 

I agree, that would be very nice.  I'll find some time to dig up the
mount code.

Dave

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-20  5:24           ` Lars Marowsky-Bree
@ 2005-10-20 10:03             ` Joel Becker
  2005-10-20 10:25               ` David Teigland
  0 siblings, 1 reply; 31+ messages in thread
From: Joel Becker @ 2005-10-20 10:03 UTC (permalink / raw)
  To: ocfs2-devel

On Thu, Oct 20, 2005 at 12:23:58PM +0200, Lars Marowsky-Bree wrote:
> On 2005-10-19T17:34:32, Jeff Mahoney <jeffm@suse.com> wrote:
> > In order to do this, I think that at mount time, we should call out to
> > user space to tell it to start caring about this node for a particular
> > heart beat group. When the file system is umounted, we call out again
> > and tell it to stop caring.
> 
> What might just be about possible is that "mount" is patched to know
> that it has to go through some special steps to "mount" an OCFS2 fs;
> namely, not do it itself directly, but tell the CRM "Hey, user wants
> this mounted, see what you can do".

	I'm not sure if you guys realize this, but mount is already
heavily involved in the cluster.  Mount.ocfs2 is a separate program, and
in the O2CB world it handles starting the heartbeat for a particular
device in the "local heartbeat" mode (by "local heartbeat", we mean our
default mode of "must be heartbeating on the device that is the mounted
filesystem").  So, there is never a kernel callout to ask the cluster
manager to care.  We don't like kernel callouts :-)  It's all done in
user from mount.ocfs2.  Having mount.ocfs2 know how to talk to CRM would
be entirely analogous.

Joel

-- 

"In the long run...we'll all be dead."
                                        -Unknown

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-20 10:03             ` Joel Becker
@ 2005-10-20 10:25               ` David Teigland
  2005-10-20 10:42                 ` Joel Becker
  0 siblings, 1 reply; 31+ messages in thread
From: David Teigland @ 2005-10-20 10:25 UTC (permalink / raw)
  To: ocfs2-devel

On Thu, Oct 20, 2005 at 08:03:41AM -0700, Joel Becker wrote:
> heavily involved in the cluster.  Mount.ocfs2 is a separate program, and
> in the O2CB world it handles starting the heartbeat for a particular
> device in the "local heartbeat" mode (by "local heartbeat", we mean our
> default mode of "must be heartbeating on the device that is the mounted
> filesystem").  So, there is never a kernel callout to ask the cluster
> manager to care.  We don't like kernel callouts :-)  It's all done in
> user from mount.ocfs2.  Having mount.ocfs2 know how to talk to CRM would
> be entirely analogous.

For some reason we've never used a mount.gfs so it didn't even cross my
mind -- it sounds like the obvious way to go.  That description I sent of
how gfs mount interacts with the cluster bits may be changing now...

Thanks,
Dave

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-20 10:25               ` David Teigland
@ 2005-10-20 10:42                 ` Joel Becker
  2005-10-20 10:45                   ` Lars Marowsky-Bree
  2005-10-21  4:09                   ` Christoph Hellwig
  0 siblings, 2 replies; 31+ messages in thread
From: Joel Becker @ 2005-10-20 10:42 UTC (permalink / raw)
  To: ocfs2-devel

On Thu, Oct 20, 2005 at 10:26:06AM -0500, David Teigland wrote:
> For some reason we've never used a mount.gfs so it didn't even cross my
> mind -- it sounds like the obvious way to go.  That description I sent of
> how gfs mount interacts with the cluster bits may be changing now...

	Hehe, when you said "mount process..." in your description, I
assumed mount.gfs :-)  You might want to rip off the
generic-mount-option-parsing code from mount.ocfs2 (which we ripped off
from mount.smb).

Joel

-- 

"And yet I fight,
 And yet I fight this battle all alone.
 No one to cry to;
 No place to call home."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-20 10:42                 ` Joel Becker
@ 2005-10-20 10:45                   ` Lars Marowsky-Bree
  2005-10-21  4:05                     ` Andrew Beekhof
  2005-10-21  4:09                   ` Christoph Hellwig
  1 sibling, 1 reply; 31+ messages in thread
From: Lars Marowsky-Bree @ 2005-10-20 10:45 UTC (permalink / raw)
  To: ocfs2-devel

On 2005-10-20T08:42:44, Joel Becker <Joel.Becker@oracle.com> wrote:

> 	Hehe, when you said "mount process..." in your description, I
> assumed mount.gfs :-)  You might want to rip off the
> generic-mount-option-parsing code from mount.ocfs2 (which we ripped off
> from mount.smb).

So that settles how to resolve this. How boring! A solved problem. ;-)


Sincerely,
    Lars Marowsky-Br?e <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-20 10:45                   ` Lars Marowsky-Bree
@ 2005-10-21  4:05                     ` Andrew Beekhof
  2005-10-24  6:41                       ` Lars Marowsky-Bree
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Beekhof @ 2005-10-21  4:05 UTC (permalink / raw)
  To: ocfs2-devel


On Oct 20, 2005, at 5:45 PM, Lars Marowsky-Bree wrote:

> On 2005-10-20T08:42:44, Joel Becker <Joel.Becker@oracle.com> wrote:
>
>
>>     Hehe, when you said "mount process..." in your description, I
>> assumed mount.gfs :-)  You might want to rip off the
>> generic-mount-option-parsing code from mount.ocfs2 (which we  
>> ripped off
>> from mount.smb).
>>
>
> So that settles how to resolve this. How boring! A solved problem. ;-)
>

for future reference, in the context of hb2 the callout will have to  
either:
- create a placement constraint to put the new clone on the correct node
- increase the number of clones
- somehow figure out if the new clone would ever be started  
(otherwise it may block forever)
- wait for the new clone to be started

or, do nothing (if invoked by hb2 directly)

--
Andrew Beekhof

"If it weren't for my horse, I wouldn't have spent that year in  
college" - Unknown, courtesy of Lewis Black

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-20 10:42                 ` Joel Becker
  2005-10-20 10:45                   ` Lars Marowsky-Bree
@ 2005-10-21  4:09                   ` Christoph Hellwig
  2005-10-21  9:29                     ` Robert Wipfel
  1 sibling, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2005-10-21  4:09 UTC (permalink / raw)
  To: ocfs2-devel

On Thu, Oct 20, 2005 at 08:42:44AM -0700, Joel Becker wrote:
> On Thu, Oct 20, 2005 at 10:26:06AM -0500, David Teigland wrote:
> > For some reason we've never used a mount.gfs so it didn't even cross my
> > mind -- it sounds like the obvious way to go.  That description I sent of
> > how gfs mount interacts with the cluster bits may be changing now...
> 
> 	Hehe, when you said "mount process..." in your description, I
> assumed mount.gfs :-)  You might want to rip off the
> generic-mount-option-parsing code from mount.ocfs2 (which we ripped off
> from mount.smb).

Heh.  Maybe you would be kind enough to rip it out from all of them
and submit a patch to create a libmount or similar to the util-linux
maintainer? ;-)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-21  4:09                   ` Christoph Hellwig
@ 2005-10-21  9:29                     ` Robert Wipfel
  2005-11-06 23:01                       ` Christoph Hellwig
  0 siblings, 1 reply; 31+ messages in thread
From: Robert Wipfel @ 2005-10-21  9:29 UTC (permalink / raw)
  To: ocfs2-devel

>>> On Fri, Oct 21, 2005 at  3:09 am, in message
<20051021090937.GA30904@lst.de>,
Christoph Hellwig <hch@lst.de> wrote: 
> On Thu, Oct 20, 2005 at 08:42:44AM - 0700, Joel Becker wrote:
>> On Thu, Oct 20, 2005 at 10:26:06AM - 0500, David Teigland wrote:
>> > For some reason we've never used a mount.gfs so it didn't even
cross my
>> > mind --  it sounds like the obvious way to go.  That description I
sent of
>> > how gfs mount interacts with the cluster bits may be changing
now...
>> 
>> 	Hehe, when you said "mount process..." in your description, I
>> assumed mount.gfs :- )  You might want to rip off the
>> generic- mount- option- parsing code from mount.ocfs2 (which we
ripped off
>> from mount.smb).
> 
> Heh.  Maybe you would be kind enough to rip it out from all of them
> and submit a patch to create a libmount or similar to the util-
linux
> maintainer? ;- )

That would be nice - presumably it might also help solve the problem
of protecting non-cluster aware file systems too, that are today
protected
against conflicting mounts by userspace conventions, that don't always
guard against an admin manually mounting a file system on one node
that
might've been mounted by a cluster resource on some other node. When
the mount is issued, it would be nice to have some way to determine 
whether to apply the extra cluster checks; for the case of a file
system on
shared disk, versus do a normal mount for a file system on local disk.
For the cluster case, of a non-cluster aware file system, the mount
could
take a per-fs global lock, and on conflicting nodes the lock would
prevent
a mount (and maybe fsck too). Otoh, assuming a file system that's
really
awesome at single-node operations, with latent potential to expand into
a
cluster, hopefully this kind of protection would be built-in and with
all the
tools aware of the need to respect concurrent (cluster) access. Can the
disk
volume manager help - by protecting the file system's block
devices....

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-21  4:05                     ` Andrew Beekhof
@ 2005-10-24  6:41                       ` Lars Marowsky-Bree
  2005-10-24  8:39                         ` Andrew Beekhof
  0 siblings, 1 reply; 31+ messages in thread
From: Lars Marowsky-Bree @ 2005-10-24  6:41 UTC (permalink / raw)
  To: ocfs2-devel

On 2005-10-21T11:05:37, Andrew Beekhof <abeekhof@suse.de> wrote:

> for future reference, in the context of hb2 the callout will have to  
> either:
> - create a placement constraint to put the new clone on the correct node
> - increase the number of clones
> - somehow figure out if the new clone would ever be started  
> (otherwise it may block forever)

In that case the clone will still be orphaned/inactive after the cluster
has reached S_IDLE again.

> - wait for the new clone to be started
> 
> or, do nothing (if invoked by hb2 directly)


Sincerely,
    Lars Marowsky-Br?e <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-24  6:41                       ` Lars Marowsky-Bree
@ 2005-10-24  8:39                         ` Andrew Beekhof
  0 siblings, 0 replies; 31+ messages in thread
From: Andrew Beekhof @ 2005-10-24  8:39 UTC (permalink / raw)
  To: ocfs2-devel


On Oct 24, 2005, at 1:41 PM, Lars Marowsky-Bree wrote:

> On 2005-10-21T11:05:37, Andrew Beekhof <abeekhof@suse.de> wrote:
>
>
>> for future reference, in the context of hb2 the callout will have to
>> either:
>> - create a placement constraint to put the new clone on the  
>> correct node
>> - increase the number of clones
>> - somehow figure out if the new clone would ever be started
>> (otherwise it may block forever)
>>
>
> In that case the clone will still be orphaned/inactive after the  
> cluster
> has reached S_IDLE again.
>

that's probably the most plausible way to do it.  though it relies on  
the cluster being otherwise stable which may not always be true.

>
>> - wait for the new clone to be started
>>
>> or, do nothing (if invoked by hb2 directly)
>>
>
>
> Sincerely,
>     Lars Marowsky-Br?e <lmb@suse.de>
>
> -- 
> High Availability & Clustering
> SUSE Labs, Research and Development
> SUSE LINUX Products GmbH - A Novell Business     -- Charles Darwin
> "Ignorance more frequently begets confidence than does knowledge"
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-devel
>

--
Andrew Beekhof

"Eating fruit is mean and vicious... keep your hands off Golden  
Delicious" - TISM

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] Re: [RFC] Integration with external clustering
  2005-10-18 16:52 [Ocfs2-devel] [RFC] Integration with external clustering Jeff Mahoney
  2005-10-18 17:18 ` Joel Becker
@ 2005-10-28 10:11 ` Lars Marowsky-Bree
  1 sibling, 0 replies; 31+ messages in thread
From: Lars Marowsky-Bree @ 2005-10-28 10:11 UTC (permalink / raw)
  To: ocfs2-devel

On 2005-10-18T17:56:27, Jeff Mahoney <jeffm@suse.com> wrote:

Hi all,

just want to make sure this doesn't get lost. Where are we currently
at?

FYI, I'd like to ask for an additional way of documenting a suggested
approach: Please show how to setup a, say, 3 node "cluster" (statically)
and how to shut it down again - on the commandline with shell scripts
;-) Hey, we're only operating on configfs/sysfs style "text files" and
directories, no? That should be possible.

Not only will it be a good basis for a regression test of the API, but
it'll also help us understand how the scripts for the Cluster Resource
Manager integration will have to look like and whether that's a
workable approach.

Anybody thinking I'm on drugs? ;-)


Sincerely,
    Lars Marowsky-Br?e <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-10-21  9:29                     ` Robert Wipfel
@ 2005-11-06 23:01                       ` Christoph Hellwig
  2005-11-07  6:08                         ` Lars Marowsky-Bree
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2005-11-06 23:01 UTC (permalink / raw)
  To: ocfs2-devel

On Fri, Oct 21, 2005 at 08:28:50AM -0600, Robert Wipfel wrote:
> That would be nice - presumably it might also help solve the problem
> of protecting non-cluster aware file systems too, that are today
> protected
> against conflicting mounts by userspace conventions, that don't always
> guard against an admin manually mounting a file system on one node
> that
> might've been mounted by a cluster resource on some other node.

Umm, no.  not at all :) what I meant was routines for parsing mount
options so far.  when we find additional code useful for mount helpers
we can move it there aswell, but I'm not interested in new functionally,
especially not such things.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Ocfs2-devel] [RFC] Integration with external clustering
  2005-11-06 23:01                       ` Christoph Hellwig
@ 2005-11-07  6:08                         ` Lars Marowsky-Bree
  0 siblings, 0 replies; 31+ messages in thread
From: Lars Marowsky-Bree @ 2005-11-07  6:08 UTC (permalink / raw)
  To: ocfs2-devel

On 2005-11-07T06:01:17, Christoph Hellwig <hch@lst.de> wrote:

> Umm, no.  not at all :) what I meant was routines for parsing mount
> options so far.  when we find additional code useful for mount helpers
> we can move it there aswell, but I'm not interested in new functionally,
> especially not such things.

Well, given that it can be implemented via the same hooks entirely in
user-space, everyone can be happy.



-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2005-11-07  6:08 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-10-18 16:52 [Ocfs2-devel] [RFC] Integration with external clustering Jeff Mahoney
2005-10-18 17:18 ` Joel Becker
2005-10-18 18:03   ` Lars Marowsky-Bree
2005-10-18 18:27     ` Joel Becker
2005-10-18 18:50       ` Mark Fasheh
2005-10-19  8:26       ` Lars Marowsky-Bree
2005-10-19 12:49         ` Joel Becker
2005-10-19 17:41           ` Jeff Mahoney
2005-10-20  7:39             ` Lars Marowsky-Bree
2005-10-19 16:30         ` Jeff Mahoney
2005-10-20  5:24           ` Lars Marowsky-Bree
2005-10-20 10:03             ` Joel Becker
2005-10-20 10:25               ` David Teigland
2005-10-20 10:42                 ` Joel Becker
2005-10-20 10:45                   ` Lars Marowsky-Bree
2005-10-21  4:05                     ` Andrew Beekhof
2005-10-24  6:41                       ` Lars Marowsky-Bree
2005-10-24  8:39                         ` Andrew Beekhof
2005-10-21  4:09                   ` Christoph Hellwig
2005-10-21  9:29                     ` Robert Wipfel
2005-11-06 23:01                       ` Christoph Hellwig
2005-11-07  6:08                         ` Lars Marowsky-Bree
2005-10-20  6:04           ` Andrew Beekhof
2005-10-18 18:47     ` Mark Fasheh
2005-10-19  8:35       ` Lars Marowsky-Bree
2005-10-18 18:20   ` Jeff Mahoney
2005-10-19 14:57     ` Lars Marowsky-Bree
2005-10-19 17:42       ` David Teigland
2005-10-20  5:58         ` Lars Marowsky-Bree
2005-10-20  9:45           ` David Teigland
2005-10-28 10:11 ` [Ocfs2-devel] " Lars Marowsky-Bree

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.