* [Ocfs2-devel] [RFC] Fencing harness for OCFS2
@ 2006-05-26 3:31 Daniel Phillips
2006-05-30 19:38 ` Lars Marowsky-Bree
0 siblings, 1 reply; 11+ messages in thread
From: Daniel Phillips @ 2006-05-26 3:31 UTC (permalink / raw)
To: ocfs2-devel
Goals:
- Lightweight, kernel based fencing harness
- Support pluggable fencing methods
- Pluggable methods take policy out of kernel
- No reinvented wheels, use kernel modules
- Also accomodate user space fencing methods
- Divide work appropriately between kernel and user space
- Obey memory deadlock prevention rules
- Obey safe module unload rules
- Handle multiple clusters per node
Fencing is the act of preventing an incommunicado node from accessing shared
cluster storage. Currently, what OCFS2 calls fencing is really a watchdog
that panics an incommunicado node after a predetermined number of missed
heartbeats. This does prevent the incommunicado node from accessing shared
storage, but as a fencing scheme it has disadvantages:
1) The remaining nodes must wait at least as long as the watchdog timeout
before recovering any of the parted nodes locks.
2) Panicking annoys cluster administrators, may take nodes offline for
unreasonably long periods, and is prone to endless panic cycles.
We can think of the existing watchdog scheme as one particular fencing method.
Most cluster configurations can support much better fencing methods. For
example, a storage network may support switch or sever based IP address
banning. This proposal describes a modular framework that can accomodate
a wide variety of fencing schemes in a simple, robust and extensible way.
Relationship to Existing Watchdog
---------------------------------
The proposed fencing harness is independent of the existing watchdog, which
can continue to exist in its current form, though confusion would be
reduced by renaming it more accurately at this point. Eventually we
will want to parameterize the watchdog similarly to fencing, so that for
example an IP-banning fencing method can be paired with a watchdog method
that does not panic in the event of a network split. Even without
generalizing the watchdog methods we will still see an immediate benefit
from the new fencing harness in that the cluster will be able to recover
locks faster than the panic-based watchdog method.
To capture exactly the behavior of the existing watchdog, we may provide a
fencing method, call it "watchdog", that simply waits a predetermined time,
then reports success. During this wait the target node is presumed to have
fenced itself by panicking or otherwise. We might wish to implement a
"manual" fencing method, which might send a network message to some
administration address and wait to receive a reply. Since it is always
possible to implement an OCFS2-style watchdog and the limitations of
the watchdog method do not render it completely useless, we could make
the watchdog method the default if no other method is specified.
Registering a Fencing Method
----------------------------
Each fencing method is defined in a kernel module. A single module may
define more than one fencing method. In the module init, one or more
fencing methods will be registered with the OCFS2 cluster stack, giving
the name of the method, a function entry to invoke the method and the
module owner.
Something like:
err = node_register_fence_method(name, fn, owner);
Providing that no method of the same name is already registered, the method
will be added to a static list of available methods. We need to remember
the owner module so that the module can be locked into the kernel whenever
the fencing method could possibly be invoked.
Normally, each node of an OCFS2 cluster will load the same fencing methods.
We could in theory relax this if we do not require every node to be able to
carry out fencing. For now it is simpler to assume every node can possibly
fence other nodes.
Associating Nodes with Fencing Methods
--------------------------------------
The user space tools have available a global configuration file that
enumerates all the nodes that can possibly join the cluster. For each
node we supply a configuration line that states the name of the fencing
method to be used for that node. We may also state other details such as
the period to allow for a watchdog method.
The user space tools parse the configuration file into a digestable form
for the kernel components and pass it to the kernel in what whatever
format the userspace tools and fencing methods agree between themselves.
This information will be available internally to fencing methods that
need to know how to perform configuration-specific actions. For the
time being we do not need to worry about stabilizing this format because
we can require that the user tools exactly match the kernel module used.
The node manger checks that every fencing method mentioned in the
configuration file is already registered, otherwise the node might not
be able to fullfill its duty if it is called upon to fence another node.
If the node cannot handle every fencing method used by any node, the
join attempt will fail.
Up to this point, there is no requirement to obey memory deadlock rules
because no cluster filesystem can yet be mounted. This means that the
above steps can be executed in user space if we wish, with the exception
of filling in the kernel node structures. However, there is not very
much code required and a user space linkage might well outweigh any
kernel code savings. For now it is easiest to do in kernel.
After the node has joined the cluster it will begin to receive membership
events to inform it which other nodes belong to the cluster. For each
other node in the cluster the node manager creates a node structure and
fills in the node's fencing method entry point by looking up the named
fencing method in the list of registered fencing methods. We can at
this point also add a pointer to any configuration details specified on
the node's fence configuration line.
As soon as our node has fully joined the cluster a mount could possibly
take place, so memory deadlock rules come into play.
Note: my description of node join events may not match exactly the way
OCFS2 does it at this point.
Invoking a Fencing Method
----------------------------
For sanity's sake, only one node on the cluster will have the duty of
initiating fencing. For simplicity, we can let that be the heartbeat
node, or in OCFS2 terms, the lowest numbered node in the cluster.
Heartbeat reports to the node manager that a node needs to be fenced.
The node manager invokes fencing with a call like:
err = target_node->fence->initiate(target_node);
A zero error result means that fencing has been initiated. The fence
method reports completion asynchronously by sending a message to the mode
manager, something like:
write(thisnode->nodeman->socket, {FENCED, thisnode, errno, errmsg}, len);
A zero errno means that fencing was successful and the errmsg is empty.
As long as there are any fencing operations in progress the module that
owns the method may not be removed. An easy way to implement this is to
prevent the node from leaving the cluster until outstanding fencing
operations have completed. This in turn is accomplished by incrementing
a counter before fencing is initiated and decrementing it when the fence
result message is received or if initiating fails. After the node
leaves the cluster it decrements the module count for every fence method
that it orginally incremented, allowing the module to be unloaded if no
other cluster is using any of the fence methods.
User Space Fencing Methods
--------------------------
Fencing may be implemented in userspace, however a module must be
written to implement the linkage. Most likely, user space fencing will
take the form of a memlocked daemon that communicates with the kernel
module using a socket, which would be opened at module initialization
time or alternatively (and with some additional kernel support) at
node bringup time. Userspace fencing methods must obey memory deadlock
prevention rules. This is hard, so maybe we should get the kernel
based methods working first.
Regards,
Daniel
^ permalink raw reply [flat|nested] 11+ messages in thread* [Ocfs2-devel] [RFC] Fencing harness for OCFS2 2006-05-26 3:31 [Ocfs2-devel] [RFC] Fencing harness for OCFS2 Daniel Phillips @ 2006-05-30 19:38 ` Lars Marowsky-Bree 2006-05-30 19:42 ` Jeff Mahoney 2006-05-30 19:45 ` Daniel Phillips 0 siblings, 2 replies; 11+ messages in thread From: Lars Marowsky-Bree @ 2006-05-30 19:38 UTC (permalink / raw) To: ocfs2-devel On 2006-05-25T20:31:33, Daniel Phillips <phillips@google.com> wrote: > Goals: > - Lightweight, kernel based fencing harness > - Support pluggable fencing methods > - Pluggable methods take policy out of kernel > - No reinvented wheels, use kernel modules > - Also accomodate user space fencing methods > - Divide work appropriately between kernel and user space > - Obey memory deadlock prevention rules > - Obey safe module unload rules > - Handle multiple clusters per node Sorry we're chiming in so late, but with Jeff's user-space membership patches, we have user-space driven fencing working with heartbeat 2. Sincerely, Lars Marowsky-Br?e -- High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin "Ignorance more frequently begets confidence than does knowledge" ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [RFC] Fencing harness for OCFS2 2006-05-30 19:38 ` Lars Marowsky-Bree @ 2006-05-30 19:42 ` Jeff Mahoney 2006-05-30 20:15 ` Daniel Phillips 2006-05-30 19:45 ` Daniel Phillips 1 sibling, 1 reply; 11+ messages in thread From: Jeff Mahoney @ 2006-05-30 19:42 UTC (permalink / raw) To: ocfs2-devel -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Lars Marowsky-Bree wrote: > On 2006-05-25T20:31:33, Daniel Phillips <phillips@google.com> wrote: > >> Goals: >> - Lightweight, kernel based fencing harness >> - Support pluggable fencing methods >> - Pluggable methods take policy out of kernel >> - No reinvented wheels, use kernel modules >> - Also accomodate user space fencing methods >> - Divide work appropriately between kernel and user space >> - Obey memory deadlock prevention rules >> - Obey safe module unload rules >> - Handle multiple clusters per node > > Sorry we're chiming in so late, but with Jeff's user-space membership > patches, we have user-space driven fencing working with heartbeat 2. It works, but the "Obey memory deadlock prevention rules" line item is still an issue. - -Jeff - -- Jeff Mahoney SUSE Labs -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org iD8DBQFEfKAsLPWxlyuTD7IRAqZwAJ0egEKtPZcLcBe+mgikTXIOqIC+ngCeIMfg SyhZFMnEmajfx2ekomzRygk= =MuXE -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [RFC] Fencing harness for OCFS2 2006-05-30 19:42 ` Jeff Mahoney @ 2006-05-30 20:15 ` Daniel Phillips 2006-05-31 0:19 ` Lars Marowsky-Bree 0 siblings, 1 reply; 11+ messages in thread From: Daniel Phillips @ 2006-05-30 20:15 UTC (permalink / raw) To: ocfs2-devel Jeff Mahoney wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Lars Marowsky-Bree wrote: > >>On 2006-05-25T20:31:33, Daniel Phillips <phillips@google.com> wrote: >> >>>Goals: >>> - Lightweight, kernel based fencing harness >>> - Support pluggable fencing methods >>> - Pluggable methods take policy out of kernel >>> - No reinvented wheels, use kernel modules >>> - Also accomodate user space fencing methods >>> - Divide work appropriately between kernel and user space >>> - Obey memory deadlock prevention rules >>> - Obey safe module unload rules >>> - Handle multiple clusters per node >> >>Sorry we're chiming in so late, but with Jeff's user-space membership >>patches, we have user-space driven fencing working with heartbeat 2. > > It works, but the "Obey memory deadlock prevention rules" line item is > still an issue. As I would expect. To be sure, I am interested in hooking up Linux-HA properly to OCFS2, but what we need to do is to place the core of fencing in the kernel where it is easiest to implement anti-deadlock measures, then export an API to Linux-HA. This will be easy with the module-based API I have proposed, in fact I would be happy to prototype a module to do it. But fencing is only part of the story. The whole list of cluster manager components that can execute in the block writeout path and therefore need to obey memory deadlock rules is: * Heartbeat * Fencing * Membership and node status events * Service takeover for essential services (including DLM recovery) * Node addressing and messaging required for the above I think that is the whole list, if I have missed anything somebody please shout. Each of these components needs to get a treatment similar to what I have proposed for fencing. For example, we need a pluggable API for service takeover, which I am drafting now. If anybody really doesn't like my proposal for a fencing harness, please speak up now because the proposal for service takeover will be very similar. Regards, Daniel ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [RFC] Fencing harness for OCFS2 2006-05-30 20:15 ` Daniel Phillips @ 2006-05-31 0:19 ` Lars Marowsky-Bree 2006-05-31 18:38 ` Daniel Phillips 0 siblings, 1 reply; 11+ messages in thread From: Lars Marowsky-Bree @ 2006-05-31 0:19 UTC (permalink / raw) To: ocfs2-devel On 2006-05-30T13:15:19, Daniel Phillips <phillips@google.com> wrote: > As I would expect. To be sure, I am interested in hooking up Linux-HA > properly to OCFS2, but what we need to do is to place the core of fencing > in the kernel where it is easiest to implement anti-deadlock measures, This is not sufficient, though. The piece making the policy decision to fence also needs to be protected, as you note later: > then export an API to Linux-HA. This will be easy with the module-based > API I have proposed, in fact I would be happy to prototype a module to do > it. > > But fencing is only part of the story. The whole list of cluster manager > components that can execute in the block writeout path and therefore need > to obey memory deadlock rules is: > > * Heartbeat > * Fencing > * Membership and node status events > * Service takeover for essential services (including DLM recovery) > * Node addressing and messaging required for the above > > I think that is the whole list, if I have missed anything somebody please > shout. Each of these components needs to get a treatment similar to what > I have proposed for fencing. For example, we need a pluggable API for > service takeover, which I am drafting now. If anybody really doesn't > like my proposal for a fencing harness, please speak up now because the > proposal for service takeover will be very similar. You can't really wish to place all of these into kernel space. This is exactly what we're moving away from. I've not been very good at following the list. How do you protect against memory inversion - reliably? It's hard enough _within_ the kernel. Many parts of heartbeat do, in fact, take great pains to not cause any paging etc, yet it's very hard to guarantee this for the entire stack from low-level networking up to high-level policy decisions. It's not as much of a problem if you're not trying to run / on a CFS, though, or if at least you have local swap (to which, in theory, you could swap out writes ultimately destined for the CFS). And, of course, if one node deadlocks on memory inversion, the others are going to fence it from the outside. I know we've had this discussion for years, but I don't remember _ever_ seeing a solution. Sincerely, Lars Marowsky-Br?e -- High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin "Ignorance more frequently begets confidence than does knowledge" ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [RFC] Fencing harness for OCFS2 2006-05-31 0:19 ` Lars Marowsky-Bree @ 2006-05-31 18:38 ` Daniel Phillips 2006-05-31 23:03 ` Lars Marowsky-Bree 0 siblings, 1 reply; 11+ messages in thread From: Daniel Phillips @ 2006-05-31 18:38 UTC (permalink / raw) To: ocfs2-devel Lars Marowsky-Bree wrote: > On 2006-05-30T13:15:19, Daniel Phillips <phillips@google.com> wrote: > >>As I would expect. To be sure, I am interested in hooking up Linux-HA >>properly to OCFS2, but what we need to do is to place the core of fencing >>in the kernel where it is easiest to implement anti-deadlock measures, > > This is not sufficient, though. The piece making the policy decision to > fence also needs to be protected, as you note later: That is not quite accurate. Fencing modules as I define them do not encapsulate policy, they provide mechanism through which the cluster administrator or user space tools can implement policy. Policy is implemented by plugging in the correct methods and paramaterizing the methods. So my fencing harness proposal does _not_ put policy in kernel, only mechanism. >>But fencing is only part of the story. The whole list of cluster manager >>components that can execute in the block writeout path and therefore need >>to obey memory deadlock rules is: >> >> * Heartbeat >> * Fencing >> * Membership and node status events >> * Service takeover for essential services (including DLM recovery) >> * Node addressing and messaging required for the above >> >>I think that is the whole list, if I have missed anything somebody please >>shout. Each of these components needs to get a treatment similar to what >>I have proposed for fencing. For example, we need a pluggable API for >>service takeover, which I am drafting now. If anybody really doesn't >>like my proposal for a fencing harness, please speak up now because the >>proposal for service takeover will be very similar. > > You can't really wish to place all of these into kernel space. This is > exactly what we're moving away from. Who is moving away from? OCFS2 team guys have toyed with the idea but hopefully understand why it's a bad idea by now. If not then maybe I need to adjust the violence setting on my larting tool. Anyway, I erred in not mentioning above that only a small core of each service has to stay in kernel or otherwise implement memory deadlock avoidance measures. Most of the bulky code can indeed go to userspace without any special measures, just not anything that has to execute in the block write path. So yes, I intend to place all of the above in kernel space. OCFS2 has them there now anyway, though not in a sufficiently general form. I promise that by the time I am done OCFS2's kernel cluster stack code will be significantly smaller and more general at the same time, and you will be able to do all the fancy things you're doing now, except drive those essential services from user space. If you think you can solve the problem accurately with less kernel code then please please show me how, but don't forget that _all_ fencing code that can execute in the block write path must obey the memory deadlock rules. > I've not been very good at following the list. How do you protect > against memory inversion - reliably? It's hard enough _within_ the > kernel. To write a userspace daemon capable of executing reliably in the block writeout path: 1. Preallocate all working memory for the daemon including stack, then memlock everything 2. Must not load any libraries, exec any program or script, or otherwise do anything that can't be statically analyzed for bounded memory usage 3. Must throttle service traffic so that a static bound can be placed on memory usage (special case of 1.). 4. Do all of this not only for nodes actually running services for the cluster filesystem, but also for any cluster filesystem node the service can fail over to. 5. Audit all syscalls to ensure no memory is used. Since this is generally impossible, we need a special hack to run the user space daemon in PF_ALLOC mode (yes this works) > Many parts of heartbeat do, in fact, take great pains to not cause any > paging etc, yet it's very hard to guarantee this for the entire stack > from low-level networking up to high-level policy decisions. Yes. Much easier to code the bits the block IO path actually needs in kernel. You can in theory accomplish this in user space as above, but why anybody would want to go through that pain to end up with a fragile, bulky and arguably unmaintainable solution is not clear to me. > It's not as much of a problem if you're not trying to run / on a CFS, > though, or if at least you have local swap (to which, in theory, you > could swap out writes ultimately destined for the CFS). And, of course, > if one node deadlocks on memory inversion, the others are going to fence > it from the outside. As with pregnancy, there is no such thing as a little bit deadlocked. You can hit the memory deadlock even on a non-root partition. All you have to do is write to it, swapping will trigger it more easily but writing can still trigger it. > I know we've had this discussion for years, but I don't remember _ever_ > seeing a solution. I never got around to spelling out the easy bits, just the hard part involving receiving network packets. See here, and the series of improved patches that followed: http://lwn.net/Articles/146061/?format=printable Avoiding the deadlock in kernel (except for the network receive) is generally pretty easy. Just make sure that any kernel daemon that can execute in the block write path runs in PF_ALLOC mode and is accurately throttled, a subset of the user space requirements. Regards, Daniel ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [RFC] Fencing harness for OCFS2 2006-05-31 18:38 ` Daniel Phillips @ 2006-05-31 23:03 ` Lars Marowsky-Bree 2006-06-01 0:57 ` Daniel Phillips 0 siblings, 1 reply; 11+ messages in thread From: Lars Marowsky-Bree @ 2006-05-31 23:03 UTC (permalink / raw) To: ocfs2-devel On 2006-05-31T11:38:07, Daniel Phillips <phillips@google.com> wrote: > >>As I would expect. To be sure, I am interested in hooking up Linux-HA > >>properly to OCFS2, but what we need to do is to place the core of fencing > >>in the kernel where it is easiest to implement anti-deadlock measures, > > This is not sufficient, though. The piece making the policy decision to > > fence also needs to be protected, as you note later: > That is not quite accurate. Fencing modules as I define them do not > encapsulate policy, they provide mechanism through which the cluster > administrator or user space tools can implement policy. Policy is > implemented by plugging in the correct methods and paramaterizing the > methods. > > So my fencing harness proposal does _not_ put policy in kernel, only > mechanism. That is not an accurate interpretation of what I said, or at least, what I meant to say ;-) I meant to say that having the fencing alone in the kernel isn't sufficient; you also need to have the processes which ultimately conclude that fencing of node(s) $foo is required protected. > Who is moving away from? OCFS2 team guys have toyed with the idea but > hopefully understand why it's a bad idea by now. If not then maybe I > need to adjust the violence setting on my larting tool. It's what's been discussed at all the events I've went to in the last couple of years, one or two even organized by you, as I recall ;-) Fencing needs to talk to all sorts of nasty devices, something really well handled by user-space and some expect scripts. > Anyway, I erred in not mentioning above that only a small core of each > service has to stay in kernel or otherwise implement memory deadlock > avoidance measures. Most of the bulky code can indeed go to userspace > without any special measures, just not anything that has to execute > in the block write path. That much is true, but aren't the things which provide (policy) input to the block write path also crucial? What does it help you that, in theory, you could fence, but never reach that conclusion? > So yes, I intend to place all of the above in kernel space. OCFS2 has > them there now anyway, though not in a sufficiently general form. I > promise that by the time I am done OCFS2's kernel cluster stack code > will be significantly smaller and more general at the same time, and > you will be able to do all the fancy things you're doing now, except > drive those essential services from user space. The problem is your intent is to keep membership and heartbeating and quorum computation within the kernel. I don't think that's a good idea. They can be readily performed in user-space, and even quite protected against memory deadlocks. Same for fencing. Except for the bits which require in-kernel memory from user-space, but as you say, they can be protected with PF_MEMALLOC. However, the more processes we gain digging into PF_MEMALLOC means this reserve becomes more precious too, and they can interfere with eachother, if they all need their memory at the same time. (I thought the idea of per-process-mempools has been discussed in the past, but was met with lots of "You do it!" remarks.) Another key point to keep in mind is that, looking at the larger picture, too many things _already_ have been moved out of the kernel (and are unlikely to go back) to make it feasible to solve the problem only there. Think MPIO, which is driven via sysfs and other things with a user-space daemon; if you're under memory pressure, the system won't be able to recover the paths to your cluster filesystem, what good does it do you that your CFS fencing works? How about iscsi? > If you think you can solve the problem accurately with less kernel > code then please please show me how, but don't forget that _all_ > fencing code that can execute in the block write path must obey the > memory deadlock rules. Well, the heartbeat fencing modules ain't called from the block write path; after we've fenced a node, we signal this to the kernel (by cleaning up the links to it in OCFS2's configfs directory). Now of course, if we're deadlocked, we might never get that far. Sigh. Life sucks. > To write a userspace daemon capable of executing reliably in the block > writeout path: > > 1. Preallocate all working memory for the daemon including stack, then > memlock everything > > 2. Must not load any libraries, exec any program or script, or > otherwise do anything that can't be statically analyzed for bounded > memory usage > > 3. Must throttle service traffic so that a static bound can be placed > on memory usage (special case of 1.). > > 4. Do all of this not only for nodes actually running services for > the cluster filesystem, but also for any cluster filesystem node > the service can fail over to. > > 5. Audit all syscalls to ensure no memory is used. Since this is > generally impossible, we need a special hack to run the user > space daemon in PF_ALLOC mode (yes this works) Right, heartbeat does most of that. We've got that down pretty well, I hope. Non-blocking bounded IPC for our processes, non-blocking logging (syslog is a bitch!) etc... > > It's not as much of a problem if you're not trying to run / on a CFS, > > though, or if at least you have local swap (to which, in theory, you > > could swap out writes ultimately destined for the CFS). And, of course, > > if one node deadlocks on memory inversion, the others are going to fence > > it from the outside. > As with pregnancy, there is no such thing as a little bit deadlocked. > You can hit the memory deadlock even on a non-root partition. All you > have to do is write to it, swapping will trigger it more easily but > writing can still trigger it. Ah, you didn't read what I wrote. Read again. When we can free up memory for our "dirty" user-space stack by pushing out writes destined to the CFS to the local swap, so we can fence etc and all that. That's an idea I've been toying around with lately. Of course, when we're out of virtual memory too, that won't work either. But, I remain convinced that what we need is a general solution for the user-space side, because we're so badly dependend on it already, instead of trying to hold together the dam inside the kernel ;-) Sincerely, Lars Marowsky-Br?e -- High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin "Ignorance more frequently begets confidence than does knowledge" ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [RFC] Fencing harness for OCFS2 2006-05-31 23:03 ` Lars Marowsky-Bree @ 2006-06-01 0:57 ` Daniel Phillips 2006-06-01 10:13 ` Lars Marowsky-Bree 0 siblings, 1 reply; 11+ messages in thread From: Daniel Phillips @ 2006-06-01 0:57 UTC (permalink / raw) To: ocfs2-devel Lars Marowsky-Bree wrote: > I meant to say that having the fencing alone in the kernel isn't > sufficient; you also need to have the processes which ultimately > conclude that fencing of node(s) $foo is required protected. Yes. From my earlier post, those processes are: * Heartbeat * Membership and node status events * Node addressing and messaging required for the above and for completeness, the two others that need to be in kernel but aren't involved in the decision to fence are: * Fencing * Service takeover for essential services (including DLM recovery) Deciding whether a node needs to be fenced is pretty easy. If it missed X heartbeats, fence it. >>Who is moving away from? OCFS2 team guys have toyed with the idea but >>hopefully understand why it's a bad idea by now. If not then maybe I >>need to adjust the violence setting on my larting tool. > > It's what's been discussed at all the events I've went to in the last > couple of years, one or two even organized by you, as I recall ;-) I remember it well, I also kept my mouth shut as far as possible flights of fancy involving virtual filesystems and other metamagical wondery floated around in the crack pipe induced haze. To be sure, there were some sensible suggestions put forth also, but that wasn't one of them. > Fencing needs to talk to all sorts of nasty devices, something really > well handled by user-space and some expect scripts. If fencing itself can deadlock, it does not meet my definition of "well handled". Now suppose you have a device that really does require more code than anyone in good conscience would write in a kernel module, or that device absolutely must be fenced by a combination of bash and perl for some unfathomable reason. Then (copying from an earlier post) your options are: 1) A kernel fencing method sends messages to a dedicated fencing node that does not mount the filesystem. This may waste a node and needs some additional mechanism to avoid becoming a single point of failure. 2) A kernel fencing method sends messages to a userspace program written in C, memlocked, and running in (slight kernel hack here) PF_MEMALLOC mode. This might require a little more work than a Perl script, but then real men enjoy work. 3) A kernel fencing method sends messages to a userspace program running in a resource sandbox (e.g. UML or XEN) that does whatever it wants to. This is really buzzword compatible, really wasteful, and a great use of administration time. And using the fencing harness I described we can write one generic kernel module that is parameterized so that it can invoke any possible userspace fencing method. See how hard I work for you? Sometimes it seems this effort just goes unappreciated. >>Anyway, I erred in not mentioning above that only a small core of each >>service has to stay in kernel or otherwise implement memory deadlock >>avoidance measures. Most of the bulky code can indeed go to userspace >>without any special measures, just not anything that has to execute >>in the block write path. > > That much is true, but aren't the things which provide (policy) input to > the block write path also crucial? What does it help you that, in > theory, you could fence, but never reach that conclusion? How much policy is involved in "missed X heartbeats => fence it"? I see exactly two policy inflection points: the value of X and the period of a heartbeat. These are both to be supplied as parameters from user space at node bringup time. To be fair I haven't described the heartbeat harness yet. But it will look a lot like the fencing harness, including having pluggable heartbeat methods, so you can easily plug in your userspace heartbeat instead of kernel heartbeat if you want to. >>So yes, I intend to place all of the above in kernel space. OCFS2 has >>them there now anyway, though not in a sufficiently general form. I >>promise that by the time I am done OCFS2's kernel cluster stack code >>will be significantly smaller and more general at the same time, and >>you will be able to do all the fancy things you're doing now, except >>drive those essential services from user space. > > The problem is your intent is to keep membership and heartbeating and > quorum computation within the kernel. I don't think that's a good idea. > They can be readily performed in user-space, and even quite protected > against memory deadlocks. Same for fencing. Except for the bits which > require in-kernel memory from user-space, but as you say, they can be > protected with PF_MEMALLOC. I will buy your argument if the required kernel code is more than a few hundred lines, otherwise I don't agree, it's more important to be obviously correct. > However, the more processes we gain digging into PF_MEMALLOC means this > reserve becomes more precious too, and they can interfere with > eachother, if they all need their memory at the same time. Fortunately I anticipated that. You will see in the anti memory deadlock patches I posted this summer there is a mechanism for resizing the PF_MEMALLOC reserve as necessary, suitable for use at module init and exit time. > (I thought the idea of per-process-mempools has been discussed in the > past, but was met with lots of "You do it!" remarks.) I considered using mempools for this but I didn't like the way the code got ugly fast. Using the PF_MEMALLOC reserve has the nice feature that it just works for userspace, provided the userspace code is properly throttled of course. > Another key point to keep in mind is that, looking at the larger > picture, too many things _already_ have been moved out of the kernel > (and are unlikely to go back) to make it feasible to solve the problem > only there. I think I know what you're talking about, you think that there will be no nice interface from the kernel cluster components to userspace. Rest assured that there will be one of those and it will be a nice one, and simple too. Furthermore, because you have that pluggable harness, you can just invent your own and plug it in. > Think MPIO, which is driven via sysfs and other things with a user-space > daemon; if you're under memory pressure, the system won't be able to > recover the paths to your cluster filesystem, what good does it do you > that your CFS fencing works? How about iscsi? I am not going to lose a whole lot of sleep over MPIO wankery. If you need it, plug it in. Maybe one day we will get around to re-engineering it so it works properly. >>If you think you can solve the problem accurately with less kernel >>code then please please show me how, but don't forget that _all_ >>fencing code that can execute in the block write path must obey the >>memory deadlock rules. > > Well, the heartbeat fencing modules ain't called from the block write > path; after we've fenced a node, we signal this to the kernel (by > cleaning up the links to it in OCFS2's configfs directory). > > Now of course, if we're deadlocked, we might never get that far. Sigh. > Life sucks. Exactly. I never said "called from", I said "executes in the path of". Fencing executes in path of block IO in the sense that if the fencing doesn't happen then the block IO can't continue, therefore fencing needs to obey anti memory deadlock rules. Simple. >>>It's not as much of a problem if you're not trying to run / on a CFS, >>>though, or if at least you have local swap (to which, in theory, you >>>could swap out writes ultimately destined for the CFS). And, of course, >>>if one node deadlocks on memory inversion, the others are going to fence >>>it from the outside. >> >>As with pregnancy, there is no such thing as a little bit deadlocked. >>You can hit the memory deadlock even on a non-root partition. All you >>have to do is write to it, swapping will trigger it more easily but >>writing can still trigger it. > > > Ah, you didn't read what I wrote. Read again. When we can free up memory > for our "dirty" user-space stack by pushing out writes destined to the > CFS to the local swap, so we can fence etc and all that. That's an idea > I've been toying around with lately. > > Of course, when we're out of virtual memory too, that won't work > either. I was going to mention that. > But, I remain convinced that what we need is a general solution for the > user-space side, because we're so badly dependend on it already, instead > of trying to hold together the dam inside the kernel ;-) Your solution needs to be obviously non-deadlocking. It is considerably easier to do that analysis in kernel space. It is _possible_ to do it in user space but in the end what do you get? A more fragile, complex solution with the speedy interfaces on the wrong side of the kernel boundary and more kernel glue than you saved by taking the critical bits out of the kernel. But suit yourself, art is in the eye of the beholder. Note that I do plan a nice, tight little interface to userspace methods, which I promise will be able to interface with little effort to the work you described above. I am pretty sure your code will get smaller too, or it would if you didn't need to keep the old cruft around to support prehistoric kernels. So if the kernel gets smaller and userspace gets smaller and it all gets faster and more obviously correct, What is the problem again? Regards, Daniel ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [RFC] Fencing harness for OCFS2 2006-06-01 0:57 ` Daniel Phillips @ 2006-06-01 10:13 ` Lars Marowsky-Bree 2006-06-01 18:57 ` Daniel Phillips 0 siblings, 1 reply; 11+ messages in thread From: Lars Marowsky-Bree @ 2006-06-01 10:13 UTC (permalink / raw) To: ocfs2-devel On 2006-05-31T17:57:44, Daniel Phillips <phillips@google.com> wrote: > * Heartbeat > * Membership and node status events > * Node addressing and messaging required for the above > > and for completeness, the two others that need to be in kernel but aren't > involved in the decision to fence are: > > * Fencing > * Service takeover for essential services (including DLM recovery) > > Deciding whether a node needs to be fenced is pretty easy. If it missed X > heartbeats, fence it. That is too simplistic. For example, _you_ think it missed the heartbeats. What do the other nodes think? Did they hear it? Maybe it is you who should be fenced instead? You need to arrive at a concensus on what the cluster should look like, and (heuristically) compute the largest fully connected set. And before fencing, you need to ensure you (still) have the required quorum. And, what if you, locally, can't talk to the fencing device necessary to reset said node and the request has to be forwarded to another node? To be sure, this can be done within the kernel, but it rains over the parade as "it's so easy". ;-) (The Concensus Cluster Membership layer alone in heartbeat is roughly 25kLoC (including comments and stuff, though.) That's not to say it couldn't be cleaned up and done in 15kLoC, but it's definitely more than a couple of hundred. In particular if you then have to deal with mixed versions and so on.) > I remember it well, I also kept my mouth shut as far as possible flights of > fancy involving virtual filesystems and other metamagical wondery floated > around in the crack pipe induced haze. To be sure, there were some sensible > suggestions put forth also, but that wasn't one of them. Ah. Right. That must have been the reason. > >Fencing needs to talk to all sorts of nasty devices, something really > >well handled by user-space and some expect scripts. > > If fencing itself can deadlock, it does not meet my definition of > "well handled". Now suppose you have a device that really does > require more code than anyone in good conscience would write in a > kernel module, or that device absolutely must be fenced by a > combination of bash and perl for some unfathomable reason. Then > (copying from an earlier post) your options are: > > 1) A kernel fencing method sends messages to a dedicated fencing > node that does not mount the filesystem. This may waste a node and > needs some additional mechanism to avoid becoming a single point of > failure. > > 2) A kernel fencing method sends messages to a userspace program > written in C, memlocked, and running in (slight kernel hack here) > PF_MEMALLOC mode. This might require a little more work than a Perl > script, but then real men enjoy work. But, our (as far as we can tell, appropriately protected) user-space membership and fencing layer already does that. Why would you want to move it back to the kernel? > 3) A kernel fencing method sends messages to a userspace program running > in a resource sandbox (e.g. UML or XEN) that does whatever it wants to. > This is really buzzword compatible, really wasteful, and a great use of > administration time. Heh, yeah, people have brought this up to isolate things like iSCSI initiators and servers etc. Really quite painful. OTOH, the idead of compartmentalizing processes from eachother, down to the kernel level, does have some merit (in particular in this context). But, if you go to that length, you can already encapsulate the user-space clustering layer there, and don't need to move it into the kernel either. ;-) > How much policy is involved in "missed X heartbeats => fence it"? I see > exactly two policy inflection points: the value of X and the period of a > heartbeat. See above. > I will buy your argument if the required kernel code is more than a few > hundred lines, otherwise I don't agree, it's more important to be obviously > correct. I don't know how to say that, but the last time I found you were obviously correct has been awhile ;-) > > >However, the more processes we gain digging into PF_MEMALLOC means > >this reserve becomes more precious too, and they can interfere with > >eachother, if they all need their memory at the same time. > Fortunately I anticipated that. You will see in the anti memory > deadlock patches I posted this summer there is a mechanism for > resizing the PF_MEMALLOC reserve as necessary, suitable for use at > module init and exit time. Have they been merged yet? If not, why? (I'm not asking to be annoying, but because LKML etc is a big place and I missed the discussion. And, "this summer" has been mostly winter so far in Germany, so I'm not sure which summer you're refering to ;-) If we can resize the PF_MEMALLOC space and have authorized user-space dig into it, did you not just cause the need for this to be implemented in kernel to go away? > >Another key point to keep in mind is that, looking at the larger > >picture, too many things _already_ have been moved out of the kernel > >(and are unlikely to go back) to make it feasible to solve the problem > >only there. > I think I know what you're talking about, you think that there will be no > nice interface from the kernel cluster components to userspace. No, actually I'm saying that we've already got so many critical piece outside kernel-space that we really need a solution - ie, giving them access to PF_MEMALLOC, priorizing their network communication above others, and if the PF_MEMALLOC reserve (or some other resource) becomes a point of contention among this privileged group, arbitate as sanely as possible. > I am not going to lose a whole lot of sleep over MPIO wankery. If you need > it, plug it in. Maybe one day we will get around to re-engineering it so > it works properly. But, if you want to work really hard and be appreciated, you should put more work towards helping solving the user-space problem, because then your solution will be general. (Not just by OCFS2, but by iSCSI, GFS, MPIO, (dare I say it:) FUSE ...) > >Ah, you didn't read what I wrote. Read again. When we can free up memory > >for our "dirty" user-space stack by pushing out writes destined to the > >CFS to the local swap, so we can fence etc and all that. That's an idea > >I've been toying around with lately. > > > >Of course, when we're out of virtual memory too, that won't work > >either. > > I was going to mention that. If you're completely out of virtual memory, dude, better take that suicide pill. ;-) And, as you mention above, the critical bits get access to the PF_MEMALLOC reserve. > Your solution needs to be obviously non-deadlocking. It is considerably > easier to do that analysis in kernel space. It is _possible_ to do it > in user space but in the end what do you get? A more fragile, complex > solution with the speedy interfaces on the wrong side of the kernel > boundary and more kernel glue than you saved by taking the critical bits > out of the kernel. But suit yourself, art is in the eye of the beholder. Well, there is the political question about how you're going to get this actually _into_ the kernel when people seem to be convinced it can be handled in user-space. Many a solution has hit the wall at this stage, and I'd recommend at least giving this argument some thought, because it will come up. > Note that I do plan a nice, tight little interface to userspace methods, > which I promise will be able to interface with little effort to the work > you described above. I am pretty sure your code will get smaller too, or > it would if you didn't need to keep the old cruft around to support > prehistoric kernels. So if the kernel gets smaller and userspace gets > smaller and it all gets faster and more obviously correct, What is the > problem again? Well, if that is indeed the combined result, we shall certainly all be delighted. What's your timeline for the first useable pilot? Sincerely, Lars Marowsky-Br?e -- High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin "Ignorance more frequently begets confidence than does knowledge" ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [RFC] Fencing harness for OCFS2 2006-06-01 10:13 ` Lars Marowsky-Bree @ 2006-06-01 18:57 ` Daniel Phillips 0 siblings, 0 replies; 11+ messages in thread From: Daniel Phillips @ 2006-06-01 18:57 UTC (permalink / raw) To: ocfs2-devel Lars Marowsky-Bree wrote: > On 2006-05-31T17:57:44, I wrote: >>Deciding whether a node needs to be fenced is pretty easy. If it missed X >>heartbeats, fence it. > > That is too simplistic. > > For example, _you_ think it missed the heartbeats. What do the other > nodes think? Did they hear it? Maybe it is you who should be fenced > instead? You need to arrive at a concensus on what the cluster should > look like, and (heuristically) compute the largest fully connected set. The node doing the fencing has quorum by definition. None of the nodes in the quorum have missed X heartbeats by definition. So our node is perfectly within its rights to fence any node that has missed X heartbeats. Duh. > And before fencing, you need to ensure you (still) have the required > quorum. See, you are not far off grokking all this, you probably just need another cup of coffee. > And, what if you, locally, can't talk to the fencing device necessary to > reset said node and the request has to be forwarded to another node? Asymmetric fencing capability is covered by my earlier RFC, you will need different methods on nodes with and without fencing capability. Setting this up is handled entirely by user space. There is no special kernel support for this because none is needed. > To be sure, this can be done within the kernel, but it rains over the > parade as "it's so easy". ;-) Good thing I brought my simplicity umbrella with me today. > (The Concensus Cluster Membership layer alone in heartbeat is roughly > 25kLoC (including comments and stuff, though.) That's not to say it > couldn't be cleaned up and done in 15kLoC, but it's definitely more than > a couple of hundred. In particular if you then have to deal with mixed > versions and so on.) I hold to my prediction of a 3 digit number of lines of kernel code needed for the cluster stack, including basic methods to emulate OCFS2's current behavior. Interesting to hear I will compete with a solution an order of magnitude more bloated, and I bet, harder to audit and less performant. Since I still intend to convince you of the yummy goodness of driving your code via core events that come from kernel, much of your 15K lines of code will still be useful with this arrangement. I do not doubt that you have a much longer requirements list than the kernel code does and I agree that the bulk of the code must be in user space. > But, our (as far as we can tell, appropriately protected) user-space > membership and fencing layer already does that. Why would you want to > move it back to the kernel? It is already in kernel, I am not moving it anywhere, I am just evolving it in a natural direction. >>...You will see in the anti memory >>deadlock patches I posted [last] summer there is a mechanism for >>resizing the PF_MEMALLOC reserve as necessary, suitable for use at >>module init and exit time. > > Have they been merged yet? If not, why? (I'm not asking to be annoying, > but because LKML etc is a big place and I missed the discussion. And, > "this summer" has been mostly winter so far in Germany, so I'm not sure > which summer you're refering to ;-) Excuse me, I meant last summer, posted and discussed on netdev. Notice how that topic dropped from issue number one at last year's KS to not even on the agenda this year. That is because the interesting bit of the problem is done, though somebody (me) still has to get busy and prepare the patch for merging. This will arrive along with my NBD server rewrite, which lets me set up a nice test case for it. > If we can resize the PF_MEMALLOC space and have authorized user-space > dig into it, did you not just cause the need for this to be implemented > in kernel to go away? One of the needs, yes. Other needs include keeping the code small and easy to audit, and keeping short lines of communication to the kernel cluster filesystem, which is by far the heaviest user. >>I think I know what you're talking about, you think that there will be no >>nice interface from the kernel cluster components to userspace. > > No, actually I'm saying that we've already got so many critical piece > outside kernel-space So far you only mentioned MPIO, an out of kernel patch that still seems far from sanity. > ...we really need a solution - ie, giving them > access to PF_MEMALLOC, priorizing their network communication above > others, and if the PF_MEMALLOC reserve (or some other resource) becomes > a point of contention among this privileged group, arbitate as sanely as > possible. I agree, you need a solution. The main bit you don't have right now is a syscall or ioctl for setting/clearing PF_MEMALLOC. How about we float a really awful patch for that on lkml and wait for the hive mind to come up with something palatable? > But, if you want to work really hard and be appreciated, you should put > more work towards helping solving the user-space problem, because then > your solution will be general. > > (Not just by OCFS2, but by iSCSI, GFS, MPIO, (dare I say it:) FUSE ...) I need to stay focussed on my own work at the moment, that includes fixing the in-kernel network deadlock, fixing NBD to support local use of exports, finishing up my cluster block devices, and patching OCFS2's cluster stack to be general enough to support those cluster block devices. That means I won't be taking side trips into FUSE and MPIO for a while. You are welcome to of course, and I will code a (really lousy) userspace interface for PF_MEMALLOC for you if you like. >>>Ah, you didn't read what I wrote. Read again. When we can free up memory >>>for our "dirty" user-space stack by pushing out writes destined to the >>>CFS to the local swap, so we can fence etc and all that. That's an idea >>>I've been toying around with lately. >>> >>>Of course, when we're out of virtual memory too, that won't work >>>either. >> >>I was going to mention that. > > If you're completely out of virtual memory, dude, better take that > suicide pill. ;-) Sorry, you got that wrong once again. You can fill up all of swap just doing normal writes to a file. Try to find a mechanism in the VMM that prevents that, you won't, because that is how Linux works. By introducing swap you don't change the fundamental problem at all, just push it around a little. > Well, there is the political question about how you're going to get this > actually _into_ the kernel when people seem to be convinced it can be > handled in user-space. Those people have probably not even thought about the deadlock problem, not to mention other valid reasons. Anyway, like I keep saying, it is _already_ in kernel, I am just improving it, and shrinking it too I expect. > Many a solution has hit the wall at this stage, > and I'd recommend at least giving this argument some thought, because it > will come up. It has come up. The protests are getting weaker. Those protesting generally are not writing a lot of code which doesn't help their protests much. >>Note that I do plan a nice, tight little interface to userspace methods, >>which I promise will be able to interface with little effort to the work >>you described above. I am pretty sure your code will get smaller too, or >>it would if you didn't need to keep the old cruft around to support >>prehistoric kernels. So if the kernel gets smaller and userspace gets >>smaller and it all gets faster and more obviously correct, What is the >>problem again? > > Well, if that is indeed the combined result, we shall certainly all be > delighted. What's your timeline for the first useable pilot? "When it is ready". The next item on the agenda is an RFC for a pluggable service master takeover harness along similar lines to the fencing harness. I need this in order to integrate my block devices and I think it will also be a pretty big improvement over OCFS2's existing dlm_pick_recovery_master which is pretty scary as was discussed yesterday. I will offer a patch for those two to kick around before moving on to membership events. Regards, Daniel ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [RFC] Fencing harness for OCFS2 2006-05-30 19:38 ` Lars Marowsky-Bree 2006-05-30 19:42 ` Jeff Mahoney @ 2006-05-30 19:45 ` Daniel Phillips 1 sibling, 0 replies; 11+ messages in thread From: Daniel Phillips @ 2006-05-30 19:45 UTC (permalink / raw) To: ocfs2-devel Lars Marowsky-Bree wrote: > On 2006-05-25T20:31:33, Daniel Phillips <phillips@google.com> wrote: > >>Goals: >> - Lightweight, kernel based fencing harness >> - Support pluggable fencing methods >> - Pluggable methods take policy out of kernel >> - No reinvented wheels, use kernel modules >> - Also accomodate user space fencing methods >> - Divide work appropriately between kernel and user space >> - Obey memory deadlock prevention rules >> - Obey safe module unload rules >> - Handle multiple clusters per node > > Sorry we're chiming in so late, but with Jeff's user-space membership > patches, we have user-space driven fencing working with heartbeat 2. Which patches? For OCFS2? If so, then what have you done to avoid memory deadlock? Regards, Daniel ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2006-06-01 18:57 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-05-26 3:31 [Ocfs2-devel] [RFC] Fencing harness for OCFS2 Daniel Phillips 2006-05-30 19:38 ` Lars Marowsky-Bree 2006-05-30 19:42 ` Jeff Mahoney 2006-05-30 20:15 ` Daniel Phillips 2006-05-31 0:19 ` Lars Marowsky-Bree 2006-05-31 18:38 ` Daniel Phillips 2006-05-31 23:03 ` Lars Marowsky-Bree 2006-06-01 0:57 ` Daniel Phillips 2006-06-01 10:13 ` Lars Marowsky-Bree 2006-06-01 18:57 ` Daniel Phillips 2006-05-30 19:45 ` Daniel Phillips
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.