From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S265716AbUGHBKY (ORCPT ); Wed, 7 Jul 2004 21:10:24 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S265724AbUGHBKY (ORCPT ); Wed, 7 Jul 2004 21:10:24 -0400 Received: from mx1.redhat.com ([66.187.233.31]:62428 "EHLO mx1.redhat.com") by vger.kernel.org with ESMTP id S265716AbUGHBHN (ORCPT ); Wed, 7 Jul 2004 21:07:13 -0400 From: Daniel Phillips Organization: Red Hat To: Lars Marowsky-Bree Subject: Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 Date: Wed, 7 Jul 2004 21:14:07 -0400 User-Agent: KMail/1.6.2 Cc: linux-kernel@vger.kernel.org, Lon Hohberger , David Teigland References: <200407050209.29268.phillips@redhat.com> <200407061734.51023.phillips@redhat.com> <20040707181650.GD12255@marowsky-bree.de> In-Reply-To: <20040707181650.GD12255@marowsky-bree.de> MIME-Version: 1.0 Content-Disposition: inline Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <200407072114.07291.phillips@redhat.com> Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org On Wednesday 07 July 2004 14:16, Lars Marowsky-Bree wrote: > On 2004-07-06T17:34:51, Daniel Phillips said: > > > And the "industry" was very reluctant > > > too. Which meant that everybody spend ages talking and not much > > > happening. > > > > We're showing up with loads of Sistina code this time. It's up to > > everybody else to ante up, and yes, I see there's more code out > > there. It's going to be quite a summer reading project. > > Yeah, I wish you the best. There's always been quite a bit of code to > show, but that alone didn't convince people ;-) I've certainly grown > a bit more experienced / cynical during that time. (Which, according > to Oscar Wilde, is the same anyway ;) OK, what I've learned from the discussion so far is, we need to avoid getting stuck too much on the HA aspects and focus more on the cluster/performance side for now. There are just too many entrenched positions on failover. Even though every component of the cluster is designed to fail over, that's just a small part of what we have to deal with: - Cluster Volume management - Cluster configuration management - Cluster membership/quorum - Node Fencing - Parallel cluster filesystems with local semantics - Distributed Locking - Cluster mirror block device - Cluster snapshot block device - Cluster administration interface, including volume managment - Cluster resource balancing - bits I forgot to mention Out of that, we need to pick the three or four items we're prepared to address immediately, that we can obviously share between at least two known cluster filesystems, and get them onto lkml for peer review. Trying to push the whole thing as one lump has never worked for anybody, and won't work in this case either. For example, the DLM is fairly non-controversial, and important in terms of performance and reliability. Let's start with that. Furthermore, nobody seems interested in arguing about the cluster block devices either, so lets just discuss how they work and get them out of the way. Then let's tackle the low level infrastructure, such as CCS (Cluster Configuration System) that does a simple job, that is, it distributes configuration files racelessly. I heard plenty of fascinating discussion of quorum strategies last night, and have a number of papers to read as a result. But that's a diversion: it can and must be pluggable. We just need to agree on how the plugs work, a considerably less ambitious task. In general, the principle is: the less important it is, the more argument there will be about it. Defer that, make it pluggable, call it policy, push it to user space, and move on. We need to agree on the basics so that we can manage network volumes with cluster filesystems on top of them. > > I can believe it. What I have just done with my cluster snapshot > > target over the last couple of weeks is, removed _every_ dependency > > on cluster infrastructure and moved the one remaining essential > > interface to user space. > > Is there a KS presentation on this? I didn't get invited to KS and > will just be allowed in for OLS, but I'll be around town already... There will be a BOF at OLS, "Cluster Infrastructure". Since I didn't get a KS invite either and what remains is more properly lkml stuff anyway, I will go canoing with Matt O'Keefe during KS as planned. We already did the necessary VFS fixups over the last year (save the non-critical flock patch, which is now in play) so there is nothing much left to beg Linus for. There are additional VFS hooks that would be nice to have for optimization, but they can wait, people will appreciate them more that way ;) The non-vfs cluster infrastructure just uses the normal module API, except for a couple of places in the DM cluster block devices where I've allowed myself some creative license, easily undone. Again, this is lkml material, not KS stuff. > > It looks like fencing is more of an issue, because having several > > node fencing systems running at the same time in ignorance of each > > other is deeply wrong. We can't just wave our hands at this by > > making it pluggable, we need to settle on one that works and use > > it. I'll humbly suggest that Sistina is furthest along in this > > regard. > > Your fencing system is fine with me; based on the assumption that you > always have to fence a failed node, you are doing the right thing. > However, the issues are more subtle when this is no longer true, and > in a 1:1 how do you arbitate who is allowed to fence? Good question. Since two-node clusters are my primary interest at the moment, I need some answers. I think the current plan is: they try to fence each other, winner take all. Each node will introspect to decide if it's in good enough shape to do the job itself, then go try to fence the other one. Alternatively, they can be configured so that one has more votes than the other, if somebody wants that broken arrangement. This is my dim recollection, I'll have more to say when I've actually hooked my stuff up to it. There are others with plenty of experience in this, see below. > > Cluster resource management is the least advanced of the components > > that our Red Hat Sistina group has to offer, mainly because it is > > seen as a matter of policy, and so the pressing need at this state > > is to provide suitable hooks. > > > > "STOMITH" :) Yes, exactly. Global load balancing is another big > > item, i.e., which node gets assigned the job of running a > > particular service, which means you need to know how much of each > > of several different kinds of resources a particular service > > requires, and what the current resource usage profile is for each > > node on the cluster. Rik van Riel is taking a run at this. > > Right, cluster resource management is one of the things where I'm > quite happy with the approach the new heartbeat resource manager is > heading down (or up, I hope ;). Combining heartbeat and resource management sounds like a good idea. Currently, we have them separate and since I have not tried it myself yet, I'll reserve comment. Dave Teigland would be more than happy to wax poetic, though. > > It's a huge, scary problem. We _must_ be able to plug in different > > solutions, all the way from completely manual to completely > > automagic, and we have to be able to handle more than one at once. > > You can plug multiple ones as long as they are managing independent > resources, obviously. However, if the CRM is the one which ultimately > decides whether a node needs to be fenced or not - based on its > knowledge of which resources it owns or could own - this gets a lot > more scary still... We do not see the CRM as being involved in fencing at present, though I can see why perhaps it ought to be. The resource manager that Lon Hohberger is cooking up is scriptable and rule-driven. I'm sure we could spend 100% of the available time on that alone. My strategy is, I send my manually-configurable cluster bits to Lon and he hooks them in so everything is automagic, then I look at how much the end result sucks/doesn't suck. There's some philosophy at work here: I feel that any cluster device that requires elaborate infrastructure and configuration to run is broken. If you can set the cluster devices up manually and they depend only on existing kernel interfaces, they're more likely to get unit testing. At the same time, these devices have to fit well into a complex infrastructure, therefore the manual interface can be driven equally well by a script or C program, and there is one tiny but crucial additional hook to allow for automatic reconnection to the cluster if something bad happens, or if the resource manager just feels the need to reorganize things. So while I'm rambling here, I'll mention that the resource manager (or anybody else) can just summarily cut the block target's pipe and the block target will politely go ask for a new one. No IOs will be failed, nothing will break, no suspend needed, just one big breaker switch to throw. This of course depends on the target using a pipe (socket) to communicate with the cluster, but even if I do switch to UDP, I'll still keep at least one pipe around, just because it makes the target so easy to control. It didn't start this way. The first prototype had a couple thousand lines of glue code to work with various possible infrastructures. Now that's all gone and there are just two pipes left, one to local user space for cluster management and the other to somewhere out on the cluster for synchronization. It's now down to 30% of the original size and runs faster as a bonus. All cluster interfaces are "read/write", except for one ioctl to reconnect a broken pipe. > > Incidently, there is already a nice crosssection of the cluster > > community on the way to sunny Minneapolis for the July meeting. > > We've reached about 50% capacity, and we have quorum, I think :-) > > Uhm, do I have to be frightened of being fenced? ;) Only if you drink too much of that kluster Koolaid Regards, Daniel