From mboxrd@z Thu Jan 1 00:00:00 1970 From: Daniel Phillips Date: Wed, 31 May 2006 17:57:44 -0700 Subject: [Ocfs2-devel] [RFC] Fencing harness for OCFS2 In-Reply-To: <20060531230302.GB4265@marowsky-bree.de> References: <44767695.7000203@google.com> <20060530193848.GP17040@marowsky-bree.de> <447CA02C.30308@suse.com> <447CA7D7.4090803@google.com> <20060531001925.GQ17040@marowsky-bree.de> <447DE28F.9090308@google.com> <20060531230302.GB4265@marowsky-bree.de> Message-ID: <447E3B88.1010108@google.com> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: ocfs2-devel@oss.oracle.com Lars Marowsky-Bree wrote: > I meant to say that having the fencing alone in the kernel isn't > sufficient; you also need to have the processes which ultimately > conclude that fencing of node(s) $foo is required protected. Yes. From my earlier post, those processes are: * Heartbeat * Membership and node status events * Node addressing and messaging required for the above and for completeness, the two others that need to be in kernel but aren't involved in the decision to fence are: * Fencing * Service takeover for essential services (including DLM recovery) Deciding whether a node needs to be fenced is pretty easy. If it missed X heartbeats, fence it. >>Who is moving away from? OCFS2 team guys have toyed with the idea but >>hopefully understand why it's a bad idea by now. If not then maybe I >>need to adjust the violence setting on my larting tool. > > It's what's been discussed at all the events I've went to in the last > couple of years, one or two even organized by you, as I recall ;-) I remember it well, I also kept my mouth shut as far as possible flights of fancy involving virtual filesystems and other metamagical wondery floated around in the crack pipe induced haze. To be sure, there were some sensible suggestions put forth also, but that wasn't one of them. > Fencing needs to talk to all sorts of nasty devices, something really > well handled by user-space and some expect scripts. If fencing itself can deadlock, it does not meet my definition of "well handled". Now suppose you have a device that really does require more code than anyone in good conscience would write in a kernel module, or that device absolutely must be fenced by a combination of bash and perl for some unfathomable reason. Then (copying from an earlier post) your options are: 1) A kernel fencing method sends messages to a dedicated fencing node that does not mount the filesystem. This may waste a node and needs some additional mechanism to avoid becoming a single point of failure. 2) A kernel fencing method sends messages to a userspace program written in C, memlocked, and running in (slight kernel hack here) PF_MEMALLOC mode. This might require a little more work than a Perl script, but then real men enjoy work. 3) A kernel fencing method sends messages to a userspace program running in a resource sandbox (e.g. UML or XEN) that does whatever it wants to. This is really buzzword compatible, really wasteful, and a great use of administration time. And using the fencing harness I described we can write one generic kernel module that is parameterized so that it can invoke any possible userspace fencing method. See how hard I work for you? Sometimes it seems this effort just goes unappreciated. >>Anyway, I erred in not mentioning above that only a small core of each >>service has to stay in kernel or otherwise implement memory deadlock >>avoidance measures. Most of the bulky code can indeed go to userspace >>without any special measures, just not anything that has to execute >>in the block write path. > > That much is true, but aren't the things which provide (policy) input to > the block write path also crucial? What does it help you that, in > theory, you could fence, but never reach that conclusion? How much policy is involved in "missed X heartbeats => fence it"? I see exactly two policy inflection points: the value of X and the period of a heartbeat. These are both to be supplied as parameters from user space at node bringup time. To be fair I haven't described the heartbeat harness yet. But it will look a lot like the fencing harness, including having pluggable heartbeat methods, so you can easily plug in your userspace heartbeat instead of kernel heartbeat if you want to. >>So yes, I intend to place all of the above in kernel space. OCFS2 has >>them there now anyway, though not in a sufficiently general form. I >>promise that by the time I am done OCFS2's kernel cluster stack code >>will be significantly smaller and more general at the same time, and >>you will be able to do all the fancy things you're doing now, except >>drive those essential services from user space. > > The problem is your intent is to keep membership and heartbeating and > quorum computation within the kernel. I don't think that's a good idea. > They can be readily performed in user-space, and even quite protected > against memory deadlocks. Same for fencing. Except for the bits which > require in-kernel memory from user-space, but as you say, they can be > protected with PF_MEMALLOC. I will buy your argument if the required kernel code is more than a few hundred lines, otherwise I don't agree, it's more important to be obviously correct. > However, the more processes we gain digging into PF_MEMALLOC means this > reserve becomes more precious too, and they can interfere with > eachother, if they all need their memory at the same time. Fortunately I anticipated that. You will see in the anti memory deadlock patches I posted this summer there is a mechanism for resizing the PF_MEMALLOC reserve as necessary, suitable for use at module init and exit time. > (I thought the idea of per-process-mempools has been discussed in the > past, but was met with lots of "You do it!" remarks.) I considered using mempools for this but I didn't like the way the code got ugly fast. Using the PF_MEMALLOC reserve has the nice feature that it just works for userspace, provided the userspace code is properly throttled of course. > Another key point to keep in mind is that, looking at the larger > picture, too many things _already_ have been moved out of the kernel > (and are unlikely to go back) to make it feasible to solve the problem > only there. I think I know what you're talking about, you think that there will be no nice interface from the kernel cluster components to userspace. Rest assured that there will be one of those and it will be a nice one, and simple too. Furthermore, because you have that pluggable harness, you can just invent your own and plug it in. > Think MPIO, which is driven via sysfs and other things with a user-space > daemon; if you're under memory pressure, the system won't be able to > recover the paths to your cluster filesystem, what good does it do you > that your CFS fencing works? How about iscsi? I am not going to lose a whole lot of sleep over MPIO wankery. If you need it, plug it in. Maybe one day we will get around to re-engineering it so it works properly. >>If you think you can solve the problem accurately with less kernel >>code then please please show me how, but don't forget that _all_ >>fencing code that can execute in the block write path must obey the >>memory deadlock rules. > > Well, the heartbeat fencing modules ain't called from the block write > path; after we've fenced a node, we signal this to the kernel (by > cleaning up the links to it in OCFS2's configfs directory). > > Now of course, if we're deadlocked, we might never get that far. Sigh. > Life sucks. Exactly. I never said "called from", I said "executes in the path of". Fencing executes in path of block IO in the sense that if the fencing doesn't happen then the block IO can't continue, therefore fencing needs to obey anti memory deadlock rules. Simple. >>>It's not as much of a problem if you're not trying to run / on a CFS, >>>though, or if at least you have local swap (to which, in theory, you >>>could swap out writes ultimately destined for the CFS). And, of course, >>>if one node deadlocks on memory inversion, the others are going to fence >>>it from the outside. >> >>As with pregnancy, there is no such thing as a little bit deadlocked. >>You can hit the memory deadlock even on a non-root partition. All you >>have to do is write to it, swapping will trigger it more easily but >>writing can still trigger it. > > > Ah, you didn't read what I wrote. Read again. When we can free up memory > for our "dirty" user-space stack by pushing out writes destined to the > CFS to the local swap, so we can fence etc and all that. That's an idea > I've been toying around with lately. > > Of course, when we're out of virtual memory too, that won't work > either. I was going to mention that. > But, I remain convinced that what we need is a general solution for the > user-space side, because we're so badly dependend on it already, instead > of trying to hold together the dam inside the kernel ;-) Your solution needs to be obviously non-deadlocking. It is considerably easier to do that analysis in kernel space. It is _possible_ to do it in user space but in the end what do you get? A more fragile, complex solution with the speedy interfaces on the wrong side of the kernel boundary and more kernel glue than you saved by taking the critical bits out of the kernel. But suit yourself, art is in the eye of the beholder. Note that I do plan a nice, tight little interface to userspace methods, which I promise will be able to interface with little effort to the work you described above. I am pretty sure your code will get smaller too, or it would if you didn't need to keep the old cruft around to support prehistoric kernels. So if the kernel gets smaller and userspace gets smaller and it all gets faster and more obviously correct, What is the problem again? Regards, Daniel