* [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 @ 2004-07-05 6:09 Daniel Phillips 2004-07-05 15:09 ` Christoph Hellwig 0 siblings, 1 reply; 55+ messages in thread From: Daniel Phillips @ 2004-07-05 6:09 UTC (permalink / raw) To: linux-kernel Red Hat and (the former) Sistina Software are pleased to announce that we will host a two day kickoff workshop on GFS and Cluster Infrastructure in Minneapolis, July 29 and 30, not too long after OLS. We call this the "Cluster Summit" because it goes well beyond GFS, and is really about building a comprehensive cluster infrastructure for Linux, which will hopefully be a reality by the time Linux 2.8 arrives. If we want that, we have to start now, and we have to work like fiends, time is short. We offer as a starting point, functional code for a half-dozen major, generic cluster subsystems that Sistina has had under development for several years. This means not just a cluster filesystem, but cluster logical volume management, generic distributed locking, cluster membership services, node fencing, user space utilities, graphical interfaces and more. Of course, it's all up for peer review. Everybody is invited, and yes, that includes OCFS and Lustre folks too. Speaking as an honorary OpenGFS team member, we will be there in force. Tentative agenda items: - GFS walkthrough: let's get hacking - GULM, the Grand Unified Lock Manager - Sistina's brand new Distributed Lock Manager - Symmetric Cluster Architecture walkthrough - Are we there yet? Infrastructure directions - GFS: Great, it works! What next? Further details, including information on travel and hotel arrangements, will be posted over the next few days on the Red Hat sponsored community cluster page: http://sources.redhat.com/cluster/ Unfortunately, space is limited. We feel we can accommodate about fifty people comfortably. Registration is first come, first served. The price is: Free! (Of course.) If you're interested, please email me. Let's set our sights on making Linux 2.8 a true cluster operating system. Regards, Daniel ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-05 6:09 [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 Daniel Phillips @ 2004-07-05 15:09 ` Christoph Hellwig 2004-07-05 18:42 ` Daniel Phillips 0 siblings, 1 reply; 55+ messages in thread From: Christoph Hellwig @ 2004-07-05 15:09 UTC (permalink / raw) To: Daniel Phillips; +Cc: linux-kernel On Mon, Jul 05, 2004 at 02:09:29AM -0400, Daniel Phillips wrote: > Red Hat and (the former) Sistina Software are pleased to announce that > we will host a two day kickoff workshop on GFS and Cluster > Infrastructure in Minneapolis, July 29 and 30, not too long after OLS. > We call this the "Cluster Summit" because it goes well beyond GFS, and > is really about building a comprehensive cluster infrastructure for > Linux, which will hopefully be a reality by the time Linux 2.8 arrives. > If we want that, we have to start now, and we have to work like fiends, > time is short. We offer as a starting point, functional code for a > half-dozen major, generic cluster subsystems that Sistina has had under > development for several years. Don't you think it's a little too short-term? I'd rather see the cluster software that could be merged mid-term on KS (and that seems to be only OCFS2 so far) ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-05 15:09 ` Christoph Hellwig @ 2004-07-05 18:42 ` Daniel Phillips 2004-07-05 19:08 ` Chris Friesen 2004-07-05 19:12 ` Lars Marowsky-Bree 0 siblings, 2 replies; 55+ messages in thread From: Daniel Phillips @ 2004-07-05 18:42 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-kernel Hi Christoph, On Monday 05 July 2004 11:09, Christoph Hellwig wrote: > On Mon, Jul 05, 2004 at 02:09:29AM -0400, Daniel Phillips wrote: > > Red Hat and (the former) Sistina Software are pleased to announce > > that we will host a two day kickoff workshop on GFS and Cluster > > Infrastructure in Minneapolis, July 29 and 30, not too long after > > OLS. We call this the "Cluster Summit" because it goes well beyond > > GFS, and is really about building a comprehensive cluster > > infrastructure for Linux, which will hopefully be a reality by the > > time Linux 2.8 arrives. If we want that, we have to start now, and > > we have to work like fiends, time is short. We offer as a starting > > point, functional code for a half-dozen major, generic cluster > > subsystems that Sistina has had under development for several > > years. > > Don't you think it's a little too short-term? Not really. It's several months later than it should have been if anything. > I'd rather see the > cluster software that could be merged mid-term on KS (and that seems > to be only OCFS2 so far) Don't you think we ought to take a look at how OCFS and GFS might share some of the same infrastructure, for example, the DLM and cluster membership services? "Think twice, merge once" Regards, Daniel ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-05 18:42 ` Daniel Phillips @ 2004-07-05 19:08 ` Chris Friesen 2004-07-05 20:29 ` Daniel Phillips 2004-07-05 19:12 ` Lars Marowsky-Bree 1 sibling, 1 reply; 55+ messages in thread From: Chris Friesen @ 2004-07-05 19:08 UTC (permalink / raw) To: Daniel Phillips; +Cc: Christoph Hellwig, linux-kernel Daniel Phillips wrote: > Don't you think we ought to take a look at how OCFS and GFS might share > some of the same infrastructure, for example, the DLM and cluster > membership services? For cluster membership, you might consider looking at the OpenAIS CLM portion. It would be nice if this type of thing was unified across more than just filesystems. Chris ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-05 19:08 ` Chris Friesen @ 2004-07-05 20:29 ` Daniel Phillips 2004-07-07 22:55 ` Steven Dake 0 siblings, 1 reply; 55+ messages in thread From: Daniel Phillips @ 2004-07-05 20:29 UTC (permalink / raw) To: Chris Friesen; +Cc: Christoph Hellwig, linux-kernel On Monday 05 July 2004 15:08, Chris Friesen wrote: > Daniel Phillips wrote: > > Don't you think we ought to take a look at how OCFS and GFS might > > share some of the same infrastructure, for example, the DLM and > > cluster membership services? > > For cluster membership, you might consider looking at the OpenAIS CLM > portion. It would be nice if this type of thing was unified across > more than just filesystems. My own project is a block driver, that's not a filesystem, right? Cluster membership services as implemented by Sistina are generic, symmetric and (hopefully) raceless. See: http://www.usenix.org/publications/library/proceedings/als00/2000papers/papers/full_papers/preslan/preslan.pdf There is much overlap between the OpenAIS and Sistina's Symmetric Cluster Architecture. You are right, we do need to get together. By the way, how do I get your source code if I don't agree with the BitKeeper license? Regards, Daniel ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-05 20:29 ` Daniel Phillips @ 2004-07-07 22:55 ` Steven Dake 2004-07-08 1:30 ` Daniel Phillips 0 siblings, 1 reply; 55+ messages in thread From: Steven Dake @ 2004-07-07 22:55 UTC (permalink / raw) To: Daniel Phillips; +Cc: Chris Friesen, Christoph Hellwig, linux-kernel On Mon, 2004-07-05 at 13:29, Daniel Phillips wrote: > On Monday 05 July 2004 15:08, Chris Friesen wrote: > > Daniel Phillips wrote: > > > Don't you think we ought to take a look at how OCFS and GFS might > > > share some of the same infrastructure, for example, the DLM and > > > cluster membership services? > > > > For cluster membership, you might consider looking at the OpenAIS CLM > > portion. It would be nice if this type of thing was unified across > > more than just filesystems. > > My own project is a block driver, that's not a filesystem, right? > Cluster membership services as implemented by Sistina are generic, > symmetric and (hopefully) raceless. See: > > http://www.usenix.org/publications/library/proceedings/als00/2000papers/papers/full_papers/preslan/preslan.pdf > > There is much overlap between the OpenAIS and Sistina's Symmetric > Cluster Architecture. You are right, we do need to get together. > > By the way, how do I get your source code if I don't agree with the > BitKeeper license? > Daniel If you mean how do you get source code to the openais project without bk, it is available as a nightly tarball download from developer.osdl.org: http://developer.osdl.org/cherry/openais If you want to contribute to openais, you can still contribute by using diff by sending patches to: openais@lists.osdl.org Regards -steve > Regards, > > Daniel > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-07 22:55 ` Steven Dake @ 2004-07-08 1:30 ` Daniel Phillips 0 siblings, 0 replies; 55+ messages in thread From: Daniel Phillips @ 2004-07-08 1:30 UTC (permalink / raw) To: sdake; +Cc: Chris Friesen, Christoph Hellwig, linux-kernel On Wednesday 07 July 2004 18:55, Steven Dake wrote: > On Mon, 2004-07-05 at 13:29, Daniel Phillips wrote: > > On Monday 05 July 2004 15:08, Chris Friesen wrote: > > > For cluster membership, you might consider looking at the OpenAIS > > > CLM portion. It would be nice if this type of thing was unified > > > across more than just filesystems. > > > > My own project is a block driver, that's not a filesystem, right? > > Cluster membership services as implemented by Sistina are generic, > > symmetric and (hopefully) raceless. See: > > > > http://www.usenix.org/publications/library/proceedings/als00/2000pa > >pers/papers/full_papers/preslan/preslan.pdf Whoops, I just noticed that that link is way wrong, I must have been asleep when I posted it. This is the correct one: http://people.redhat.com/~teigland/sca.pdf and http://sources.redhat.com/cluster/cman/ Not that the other isn't interesting, it's just a little dated and GFS-specific. Regards, Daniel ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-05 18:42 ` Daniel Phillips 2004-07-05 19:08 ` Chris Friesen @ 2004-07-05 19:12 ` Lars Marowsky-Bree 2004-07-05 20:27 ` Daniel Phillips 1 sibling, 1 reply; 55+ messages in thread From: Lars Marowsky-Bree @ 2004-07-05 19:12 UTC (permalink / raw) To: Daniel Phillips, Christoph Hellwig; +Cc: linux-kernel On 2004-07-05T14:42:27, Daniel Phillips <phillips@redhat.com> said: > Don't you think we ought to take a look at how OCFS and GFS might share > some of the same infrastructure, for example, the DLM and cluster > membership services? Indeed. If your efforts in joining the infrastructure are more successful than ours have been, more power to you ;-) Sincerely, Lars Marowsky-Brée <lmb@suse.de> -- High Availability & Clustering \ ever tried. ever failed. no matter. SUSE Labs, Research and Development | try again. fail again. fail better. SUSE LINUX AG - A Novell company \ -- Samuel Beckett ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-05 19:12 ` Lars Marowsky-Bree @ 2004-07-05 20:27 ` Daniel Phillips 2004-07-06 7:34 ` Lars Marowsky-Bree 0 siblings, 1 reply; 55+ messages in thread From: Daniel Phillips @ 2004-07-05 20:27 UTC (permalink / raw) To: Lars Marowsky-Bree; +Cc: Christoph Hellwig, linux-kernel Hi Lars, On Monday 05 July 2004 15:12, Lars Marowsky-Bree wrote: > On 2004-07-05T14:42:27, > > Daniel Phillips <phillips@redhat.com> said: > > Don't you think we ought to take a look at how OCFS and GFS might > > share some of the same infrastructure, for example, the DLM and > > cluster membership services? > > Indeed. If your efforts in joining the infrastructure are more > successful than ours have been, more power to you ;-) What problems did you run into? On a quick read-through, it seems quite straightforward for quorum, membership and distributed locking. The idea of having more than one node fencing system running at the same time seems deeply scary, we'd better make some effort to come up with something common. Regards, Daniel ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-05 20:27 ` Daniel Phillips @ 2004-07-06 7:34 ` Lars Marowsky-Bree 2004-07-06 21:34 ` Daniel Phillips 0 siblings, 1 reply; 55+ messages in thread From: Lars Marowsky-Bree @ 2004-07-06 7:34 UTC (permalink / raw) To: Daniel Phillips; +Cc: linux-kernel On 2004-07-05T16:27:51, Daniel Phillips <phillips@redhat.com> said: > > Indeed. If your efforts in joining the infrastructure are more > > successful than ours have been, more power to you ;-) > > What problems did you run into? The problems were mostly political. Maybe we tried to push too early, but 1-3 years back, people weren't really interested in agreeing on some common components or APIs. In particular a certain Linux vendor didn't even join the group ;-) And the "industry" was very reluctant too. Which meant that everybody spend ages talking and not much happening. However, times may have changed, and hopefully for the better. The push to get one solution included into the Linux kernel may be enough to convince people that this time its for real... There still is the Open Clustering Framework group though, which is a sub-group of the FSG and maybe the right umbrella to put this under, to stay away from the impression that it's a single vendor pushing. If we could revive that and make real progress, I'd be as happy as a well fed penguin. Now with OpenAIS on the table, the GFS stack, the work already done by OCF in the past (which is, admittedly, depressingly little, but I quite like the Resource Agent API for one) et cetera, there may be a good chance. I'll try to get travel approval to go to the meeting. BTW, is the mailing list working? I tried subscribing when you first announced it, but the subscription request hasn't been approved yet... Maybe I shouldn't have subscribed with the suse.de address ;-) > On a quick read-through, it seems quite straightforward for quorum, > membership and distributed locking. Believe me, you'd be amazed to find out how long you can argue on how to identify a node alone - node name, node number (sparse or continuous?), UUID...? ;-) And, how do you define quorum, and is it always needed? Some algorithms don't need quorum (ie, election algorithms can do fine without), so a membership service which only works with quorum isn't the right component etc... > The idea of having more than one node fencing system running at the same > time seems deeply scary, we'd better make some effort to come up with > something common. Yes. This is actually an important point, and fencing policies are also reasonably complex. The GFS stack seems to tie fencing quite deeply into the system (which is understandable, since you always have shared storage, otherwise a node wouldn't be part of the GFS domain in the first place). However, the new dependency based cluster resource manager we are writing right now (which we simply call "Cluster Resource Manager" for lack of creativity ;) decides whether or not it needs to fence a node based on the resources in the cluster - if it isn't affecting the resources we can run on the remaining nodes, or none of the resources requires node-level fencing, no such operation will be done. This has advantages in larger clusters (where, if split, each partition could still continue to run resources which are unaffected by the split even the other nodes cannot be fenced), in shared nothing clusters or resources which are self-fencing and do not need STONITH etc. The ties between membership, quorum and fencing are not as strong in these scenarios, at least not mandatory. So a stack which enforced fencing at these levels, and w/o coordinating with the CRM first, would not work out. And by pushing for inclusion into the main kernel, you'll also raise all sleeping zom^Wbeauties. I hope you have a long breath for the discussions ;-) There's lots of work there. Sincerely, Lars Marowsky-Brée <lmb@suse.de> -- High Availability & Clustering \ ever tried. ever failed. no matter. SUSE Labs, Research and Development | try again. fail again. fail better. SUSE LINUX AG - A Novell company \ -- Samuel Beckett ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-06 7:34 ` Lars Marowsky-Bree @ 2004-07-06 21:34 ` Daniel Phillips 2004-07-07 18:16 ` Lars Marowsky-Bree 0 siblings, 1 reply; 55+ messages in thread From: Daniel Phillips @ 2004-07-06 21:34 UTC (permalink / raw) To: Lars Marowsky-Bree; +Cc: linux-kernel, Lon Hohberger Hi Lars, On Tuesday 06 July 2004 03:34, Lars Marowsky-Bree wrote: > On 2004-07-05T16:27:51, > > Daniel Phillips <phillips@redhat.com> said: > > > Indeed. If your efforts in joining the infrastructure are more > > > successful than ours have been, more power to you ;-) > > > > What problems did you run into? > > The problems were mostly political. Maybe we tried to push too early, > but 1-3 years back, people weren't really interested in agreeing on > some common components or APIs. In particular a certain Linux vendor > didn't even join the group ;-) *blush* > And the "industry" was very reluctant > too. Which meant that everybody spend ages talking and not much > happening. We're showing up with loads of Sistina code this time. It's up to everybody else to ante up, and yes, I see there's more code out there. It's going to be quite a summer reading project. > However, times may have changed, and hopefully for the better. The > push to get one solution included into the Linux kernel may be enough > to convince people that this time its for real... It's for real, no question. There are at least two viable GPL code bases already, GFS and Lustre, with OCFS2 coming up fast. And there are several commercial (binary/evil) cluster filesystems in service already, not that Linus should care about them, but they do lend credibility. > There still is the Open Clustering Framework group though, which is a > sub-group of the FSG and maybe the right umbrella to put this under, > to stay away from the impression that it's a single vendor pushing. Oops, another code base to read ;-) > If we could revive that and make real progress, I'd be as happy as a > well fed penguin. Red Hat is solidly behind this as a _community_ effort. > Now with OpenAIS on the table, the GFS stack, the work already done > by OCF in the past (which is, admittedly, depressingly little, but I > quite like the Resource Agent API for one) et cetera, there may be a > good chance. > > I'll try to get travel approval to go to the meeting. :-) > BTW, is the mailing list working? I tried subscribing when you first > announced it, but the subscription request hasn't been approved > yet... Maybe I shouldn't have subscribed with the suse.de address ;-) Perhaps it has more to do with a cross-channel grudge? <grin> Just poke Alasdair, you know where to find him. > > On a quick read-through, it seems quite straightforward for quorum, > > membership and distributed locking. > > Believe me, you'd be amazed to find out how long you can argue on how > to identify a node alone - node name, node number (sparse or > continuous?), UUID...? ;-) I can believe it. What I have just done with my cluster snapshot target over the last couple of weeks is, removed _every_ dependency on cluster infrastructure and moved the one remaining essential interface to user space. In this way the infrastructure becomes pluggable from the cluster block device's point of view and you can run the target without any cluster infrastructure at all if you want (just dmsetup and a utility for connecting a socket to the target). This is a general technique that we're now applying to a second block driver. It's a tiny amount of kernel and userspace code which I will post pretty soon. With this refactoring, the cluster block driver shrank to less than half its former size with no loss of functionality. The nice thing is, I get to use the existing (SCA) infrastructure, but I don't have any dependency on it. > And, how do you define quorum, and is it always needed? Some > algorithms don't need quorum (ie, election algorithms can do fine > without), so a membership service which only works with quorum isn't > the right component etc... Oddly enough, there has been much discussion about quorum here as well. This must be pluggable, and we must be able to handle multiple, independent clusters, with a single node potentially belonging to more than one at the same time. Please see this, for a formal writeup on our 2.6 code base: http://people.redhat.com/~teigland/sca.pdf Is this the key to the grand, unified quorum system that will do every job perfectly? Good question, however I do know how to make it pluggable for my own component, at essentially zero cost. This makes me optimistic that we can work out something sensible, and that perhaps it's already a solved problem. It looks like fencing is more of an issue, because having several node fencing systems running at the same time in ignorance of each other is deeply wrong. We can't just wave our hands at this by making it pluggable, we need to settle on one that works and use it. I'll humbly suggest that Sistina is furthest along in this regard. > > The idea of having more than one node fencing system running at the > > same time seems deeply scary, we'd better make some effort to come > > up with something common. > > Yes. This is actually an important point, and fencing policies are > also reasonably complex. The GFS stack seems to tie fencing quite > deeply into the system (which is understandable, since you always > have shared storage, otherwise a node wouldn't be part of the GFS > domain in the first place). Oops, should have read ahead ;) The DLM is also tied deeply into the GFS stack, but that factors out nicely, and in fact, GFS can currently use two completely different fencing systems (GULM vs SCA-Fence). I think we can sort this out. > However, the new dependency based cluster resource manager we are > writing right now (which we simply call "Cluster Resource Manager" > for lack of creativity ;) decides whether or not it needs to fence a > node based on the resources in the cluster - if it isn't affecting > the resources we can run on the remaining nodes, or none of the > resources requires node-level fencing, no such operation will be > done. Cluster resource management is the least advanced of the components that our Red Hat Sistina group has to offer, mainly because it is seen as a matter of policy, and so the pressing need at this state is to provide suitable hooks. Lon Hohberger is working on system that works with the SCA framework (Magma). The preexisting Red Hat cluster team decided to re-roll their whole cluster suite within the new framework. Perhaps you would like to take a look, and tell us why this couldn't possibly work for you? (Or maybe we need to get you drunk first...) > This has advantages in larger clusters (where, if split, each > partition could still continue to run resources which are unaffected > by the split even the other nodes cannot be fenced), in shared > nothing clusters or resources which are self-fencing and do not need > STONITH etc. "STOMITH" :) Yes, exactly. Global load balancing is another big item, i.e., which node gets assigned the job of running a particular service, which means you need to know how much of each of several different kinds of resources a particular service requires, and what the current resource usage profile is for each node on the cluster. Rik van Riel is taking a run at this. It's a huge, scary problem. We _must_ be able to plug in different solutions, all the way from completely manual to completely automagic, and we have to be able to handle more than one at once. > The ties between membership, quorum and fencing are not as strong in > these scenarios, at least not mandatory. So a stack which enforced > fencing at these levels, and w/o coordinating with the CRM first, > would not work out. Yes, again, fencing looks like the one we have to fret about. The others will be a lot easier to mix and match. > And by pushing for inclusion into the main kernel, you'll also raise > all sleeping zom^Wbeauties. I hope you have a long breath for the > discussions ;-) You know I do! > There's lots of work there. Indeed, and I didn't do any work today yet, due to answering email. Incidently, there is already a nice crosssection of the cluster community on the way to sunny Minneapolis for the July meeting. We've reached about 50% capacity, and we have quorum, I think :-) Regards, Daniel ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-06 21:34 ` Daniel Phillips @ 2004-07-07 18:16 ` Lars Marowsky-Bree 2004-07-08 1:14 ` Daniel Phillips 0 siblings, 1 reply; 55+ messages in thread From: Lars Marowsky-Bree @ 2004-07-07 18:16 UTC (permalink / raw) To: Daniel Phillips; +Cc: linux-kernel, Lon Hohberger On 2004-07-06T17:34:51, Daniel Phillips <phillips@redhat.com> said: > > And the "industry" was very reluctant > > too. Which meant that everybody spend ages talking and not much > > happening. > We're showing up with loads of Sistina code this time. It's up to > everybody else to ante up, and yes, I see there's more code out there. > It's going to be quite a summer reading project. Yeah, I wish you the best. There's always been quite a bit of code to show, but that alone didn't convince people ;-) I've certainly grown a bit more experienced / cynical during that time. (Which, according to Oscar Wilde, is the same anyway ;) > It's for real, no question. There are at least two viable GPL code > bases already, GFS and Lustre, with OCFS2 coming up fast. Yes, they have some common requirements on the kernel VFS layer, though Lustre certainly has the most extensive demands. I hope someone from CFS Inc can make it to your summit. > I can believe it. What I have just done with my cluster snapshot target > over the last couple of weeks is, removed _every_ dependency on cluster > infrastructure and moved the one remaining essential interface to user > space. Is there a KS presentation on this? I didn't get invited to KS and will just be allowed in for OLS, but I'll be around town already... > Oddly enough, there has been much discussion about quorum here as well. > This must be pluggable, and we must be able to handle multiple, > independent clusters, with a single node potentially belonging to more > than one at the same time. Please see this, for a formal writeup on > our 2.6 code base: > > http://people.redhat.com/~teigland/sca.pdf Thanks for the pointer, this is a good read. > It looks like fencing is more of an issue, because having several node > fencing systems running at the same time in ignorance of each other is > deeply wrong. We can't just wave our hands at this by making it > pluggable, we need to settle on one that works and use it. I'll humbly > suggest that Sistina is furthest along in this regard. Your fencing system is fine with me; based on the assumption that you always have to fence a failed node, you are doing the right thing. However, the issues are more subtle when this is no longer true, and in a 1:1 how do you arbitate who is allowed to fence? > Cluster resource management is the least advanced of the components that > our Red Hat Sistina group has to offer, mainly because it is seen as a > matter of policy, and so the pressing need at this state is to provide > suitable hooks. > "STOMITH" :) Yes, exactly. Global load balancing is another big item, > i.e., which node gets assigned the job of running a particular service, > which means you need to know how much of each of several different > kinds of resources a particular service requires, and what the current > resource usage profile is for each node on the cluster. Rik van Riel > is taking a run at this. Right, cluster resource management is one of the things where I'm quite happy with the approach the new heartbeat resource manager is heading down (or up, I hope ;). > It's a huge, scary problem. We _must_ be able to plug in different > solutions, all the way from completely manual to completely automagic, > and we have to be able to handle more than one at once. You can plug multiple ones as long as they are managing independent resources, obviously. However, if the CRM is the one which ultimately decides whether a node needs to be fenced or not - based on its knowledge of which resources it owns or could own - this gets a lot more scary still... > Yes, again, fencing looks like the one we have to fret about. The > others will be a lot easier to mix and match. Mostly, yes. Unless you (like some) require quorum to report a cluster membership, which some implementations do. > Incidently, there is already a nice crosssection of the cluster > community on the way to sunny Minneapolis for the July meeting. We've > reached about 50% capacity, and we have quorum, I think :-) Uhm, do I have to be frightened of being fenced? ;) Sincerely, Lars Marowsky-Brée <lmb@suse.de> -- High Availability & Clustering \ ever tried. ever failed. no matter. SUSE Labs, Research and Development | try again. fail again. fail better. SUSE LINUX AG - A Novell company \ -- Samuel Beckett ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-07 18:16 ` Lars Marowsky-Bree @ 2004-07-08 1:14 ` Daniel Phillips 2004-07-08 9:10 ` Lars Marowsky-Bree 0 siblings, 1 reply; 55+ messages in thread From: Daniel Phillips @ 2004-07-08 1:14 UTC (permalink / raw) To: Lars Marowsky-Bree; +Cc: linux-kernel, Lon Hohberger, David Teigland On Wednesday 07 July 2004 14:16, Lars Marowsky-Bree wrote: > On 2004-07-06T17:34:51, Daniel Phillips <phillips@redhat.com> said: > > > And the "industry" was very reluctant > > > too. Which meant that everybody spend ages talking and not much > > > happening. > > > > We're showing up with loads of Sistina code this time. It's up to > > everybody else to ante up, and yes, I see there's more code out > > there. It's going to be quite a summer reading project. > > Yeah, I wish you the best. There's always been quite a bit of code to > show, but that alone didn't convince people ;-) I've certainly grown > a bit more experienced / cynical during that time. (Which, according > to Oscar Wilde, is the same anyway ;) OK, what I've learned from the discussion so far is, we need to avoid getting stuck too much on the HA aspects and focus more on the cluster/performance side for now. There are just too many entrenched positions on failover. Even though every component of the cluster is designed to fail over, that's just a small part of what we have to deal with: - Cluster Volume management - Cluster configuration management - Cluster membership/quorum - Node Fencing - Parallel cluster filesystems with local semantics - Distributed Locking - Cluster mirror block device - Cluster snapshot block device - Cluster administration interface, including volume managment - Cluster resource balancing - bits I forgot to mention Out of that, we need to pick the three or four items we're prepared to address immediately, that we can obviously share between at least two known cluster filesystems, and get them onto lkml for peer review. Trying to push the whole thing as one lump has never worked for anybody, and won't work in this case either. For example, the DLM is fairly non-controversial, and important in terms of performance and reliability. Let's start with that. Furthermore, nobody seems interested in arguing about the cluster block devices either, so lets just discuss how they work and get them out of the way. Then let's tackle the low level infrastructure, such as CCS (Cluster Configuration System) that does a simple job, that is, it distributes configuration files racelessly. I heard plenty of fascinating discussion of quorum strategies last night, and have a number of papers to read as a result. But that's a diversion: it can and must be pluggable. We just need to agree on how the plugs work, a considerably less ambitious task. In general, the principle is: the less important it is, the more argument there will be about it. Defer that, make it pluggable, call it policy, push it to user space, and move on. We need to agree on the basics so that we can manage network volumes with cluster filesystems on top of them. > > I can believe it. What I have just done with my cluster snapshot > > target over the last couple of weeks is, removed _every_ dependency > > on cluster infrastructure and moved the one remaining essential > > interface to user space. > > Is there a KS presentation on this? I didn't get invited to KS and > will just be allowed in for OLS, but I'll be around town already... There will be a BOF at OLS, "Cluster Infrastructure". Since I didn't get a KS invite either and what remains is more properly lkml stuff anyway, I will go canoing with Matt O'Keefe during KS as planned. We already did the necessary VFS fixups over the last year (save the non-critical flock patch, which is now in play) so there is nothing much left to beg Linus for. There are additional VFS hooks that would be nice to have for optimization, but they can wait, people will appreciate them more that way ;) The non-vfs cluster infrastructure just uses the normal module API, except for a couple of places in the DM cluster block devices where I've allowed myself some creative license, easily undone. Again, this is lkml material, not KS stuff. > > It looks like fencing is more of an issue, because having several > > node fencing systems running at the same time in ignorance of each > > other is deeply wrong. We can't just wave our hands at this by > > making it pluggable, we need to settle on one that works and use > > it. I'll humbly suggest that Sistina is furthest along in this > > regard. > > Your fencing system is fine with me; based on the assumption that you > always have to fence a failed node, you are doing the right thing. > However, the issues are more subtle when this is no longer true, and > in a 1:1 how do you arbitate who is allowed to fence? Good question. Since two-node clusters are my primary interest at the moment, I need some answers. I think the current plan is: they try to fence each other, winner take all. Each node will introspect to decide if it's in good enough shape to do the job itself, then go try to fence the other one. Alternatively, they can be configured so that one has more votes than the other, if somebody wants that broken arrangement. This is my dim recollection, I'll have more to say when I've actually hooked my stuff up to it. There are others with plenty of experience in this, see below. > > Cluster resource management is the least advanced of the components > > that our Red Hat Sistina group has to offer, mainly because it is > > seen as a matter of policy, and so the pressing need at this state > > is to provide suitable hooks. > > > > "STOMITH" :) Yes, exactly. Global load balancing is another big > > item, i.e., which node gets assigned the job of running a > > particular service, which means you need to know how much of each > > of several different kinds of resources a particular service > > requires, and what the current resource usage profile is for each > > node on the cluster. Rik van Riel is taking a run at this. > > Right, cluster resource management is one of the things where I'm > quite happy with the approach the new heartbeat resource manager is > heading down (or up, I hope ;). Combining heartbeat and resource management sounds like a good idea. Currently, we have them separate and since I have not tried it myself yet, I'll reserve comment. Dave Teigland would be more than happy to wax poetic, though. > > It's a huge, scary problem. We _must_ be able to plug in different > > solutions, all the way from completely manual to completely > > automagic, and we have to be able to handle more than one at once. > > You can plug multiple ones as long as they are managing independent > resources, obviously. However, if the CRM is the one which ultimately > decides whether a node needs to be fenced or not - based on its > knowledge of which resources it owns or could own - this gets a lot > more scary still... We do not see the CRM as being involved in fencing at present, though I can see why perhaps it ought to be. The resource manager that Lon Hohberger is cooking up is scriptable and rule-driven. I'm sure we could spend 100% of the available time on that alone. My strategy is, I send my manually-configurable cluster bits to Lon and he hooks them in so everything is automagic, then I look at how much the end result sucks/doesn't suck. There's some philosophy at work here: I feel that any cluster device that requires elaborate infrastructure and configuration to run is broken. If you can set the cluster devices up manually and they depend only on existing kernel interfaces, they're more likely to get unit testing. At the same time, these devices have to fit well into a complex infrastructure, therefore the manual interface can be driven equally well by a script or C program, and there is one tiny but crucial additional hook to allow for automatic reconnection to the cluster if something bad happens, or if the resource manager just feels the need to reorganize things. So while I'm rambling here, I'll mention that the resource manager (or anybody else) can just summarily cut the block target's pipe and the block target will politely go ask for a new one. No IOs will be failed, nothing will break, no suspend needed, just one big breaker switch to throw. This of course depends on the target using a pipe (socket) to communicate with the cluster, but even if I do switch to UDP, I'll still keep at least one pipe around, just because it makes the target so easy to control. It didn't start this way. The first prototype had a couple thousand lines of glue code to work with various possible infrastructures. Now that's all gone and there are just two pipes left, one to local user space for cluster management and the other to somewhere out on the cluster for synchronization. It's now down to 30% of the original size and runs faster as a bonus. All cluster interfaces are "read/write", except for one ioctl to reconnect a broken pipe. > > Incidently, there is already a nice crosssection of the cluster > > community on the way to sunny Minneapolis for the July meeting. > > We've reached about 50% capacity, and we have quorum, I think :-) > > Uhm, do I have to be frightened of being fenced? ;) Only if you drink too much of that kluster Koolaid Regards, Daniel ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-08 1:14 ` Daniel Phillips @ 2004-07-08 9:10 ` Lars Marowsky-Bree 2004-07-08 10:53 ` David Teigland 0 siblings, 1 reply; 55+ messages in thread From: Lars Marowsky-Bree @ 2004-07-08 9:10 UTC (permalink / raw) To: Daniel Phillips; +Cc: linux-kernel, Lon Hohberger, David Teigland On 2004-07-07T21:14:07, Daniel Phillips <phillips@redhat.com> said: > OK, what I've learned from the discussion so far is, we need to avoid > getting stuck too much on the HA aspects and focus more on the > cluster/performance side for now. There are just too many entrenched > positions on failover. Well, first, failover is not all of HA. But that's a different diversion again. > Out of that, we need to pick the three or four items we're prepared to > address immediately, that we can obviously share between at least two > known cluster filesystems, and get them onto lkml for peer review. Ok. > For example, the DLM is fairly non-controversial, and important in > terms of performance and reliability. Let's start with that. I doubt that assessment, the DLM is going to be somewhat controversial already and requires the dragging in of membership, inter-node messaging, fencing and quorum. The problem is that you cannot easily separate out the different pieces. I'd humbly suggest to start with the changes in the VFS layers which the CFS's of the different kinds require, regardless of which infrastructure they use. Of all the cluster-subsystems, the fencing system is likely the most important. If the various implementations don't step on eachothers toes there, the duplication of membership/messaging/etc is only inefficient, but not actively harmful. > I heard plenty of fascinating discussion of quorum strategies last > night, and have a number of papers to read as a result. But that's a > diversion: it can and must be pluggable. We just need to agree on how > the plugs work, a considerably less ambitious task. When you argue whether or not you can mandate quorum for a given cluster implementation, and which layers of the cluster are allowed to require quorum (some will refuse to even tell you the membership without quorum; some will require quorum before they fence, others will recover quorum by fencing), this discussion is fairly complex. Again, let's see what kernel hooks these require, and defer all the rest of the discussions as far as possible. > it policy, push it to user space, and move on. We need to agree on the > basics so that we can manage network volumes with cluster filesystems > on top of them. Ah, that in itself is a very data-centric point of view and not exactly applicable to the needs of shared-nothing clusters. (I'm not trying to nitpick, just trying to make you aware of all the hidden assumptions you may not be aware of yourself.) Of course, this is perfectly fine for something such as GFS (which, being SAN based, of course requires these), but a cluster infrastructure in the kernel may not be limitted to this. > > Is there a KS presentation on this? I didn't get invited to KS and > > will just be allowed in for OLS, but I'll be around town already... > There will be a BOF at OLS, "Cluster Infrastructure". Since I didn't > get a KS invite either and what remains is more properly lkml stuff > anyway, I will go canoing with Matt O'Keefe during KS as planned. Ah, okay. > > Your fencing system is fine with me; based on the assumption that you > > always have to fence a failed node, you are doing the right thing. > > However, the issues are more subtle when this is no longer true, and > > in a 1:1 how do you arbitate who is allowed to fence? > Good question. Since two-node clusters are my primary interest at the > moment, I need some answers. Two-node clusters are reasonably easy, true. > I think the current plan is: they try to fence each other, winner take > all. Each node will introspect to decide if it's in good enough shape > to do the job itself, then go try to fence the other one. Ok, this is essentially what heartbeat does, but it gets more complex with >2 nodes. In which case your cluster block device is going to run into interesting synchronization issues, too, I'd venture. (Or at least drbd does, where we look at replicating across >2 nodes.) > > resources, obviously. However, if the CRM is the one which ultimately > > decides whether a node needs to be fenced or not - based on its > > knowledge of which resources it owns or could own - this gets a lot > > more scary still... > We do not see the CRM as being involved in fencing at present, though I > can see why perhaps it ought to be. The resource manager that Lon > Hohberger is cooking up is scriptable and rule-driven. Frankly, I'm kind of disappointed; why are you cooking up your own once more? When we set out to write a new dependency-based flexible resource manager, we explicitly made it clear that it wasn't just meant to run on top of heartbeat, but in theory on top of any cluster infrastructure. I know this is the course of Open Source development, and that "community project" basically means "my wheel be better than your wheel, and you are allowed to get behind it after we are done, but don't interfere before that", but I'd have expected some discussions or at least solicitation of them on the established public mailing lists, just to keep up the pretense ;-) Sincerely, Lars Marowsky-Brée <lmb@suse.de> -- High Availability & Clustering \ ever tried. ever failed. no matter. SUSE Labs, Research and Development | try again. fail again. fail better. SUSE LINUX AG - A Novell company \ -- Samuel Beckett ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-08 9:10 ` Lars Marowsky-Bree @ 2004-07-08 10:53 ` David Teigland 2004-07-08 14:14 ` Chris Friesen ` (2 more replies) 0 siblings, 3 replies; 55+ messages in thread From: David Teigland @ 2004-07-08 10:53 UTC (permalink / raw) To: linux-kernel; +Cc: Daniel Phillips, Lars Marowsky-Bree On Thu, Jul 08, 2004 at 11:10:43AM +0200, Lars Marowsky-Bree wrote: > Of all the cluster-subsystems, the fencing system is likely the most > important. If the various implementations don't step on eachothers toes > there, the duplication of membership/messaging/etc is only inefficient, > but not actively harmful. I'm afraid the fencing issue has been rather misrepresented. Here's what we're doing (a lot of background is necessary I'm afraid.) We have a symmetric, kernel-based, stand-alone cluster manager (CMAN) that has no ties to anything else whatsoever. It'll simply run and answer the question "who's in the cluster?" by providing a list of names/nodeids. So, if that's all you want you can just run cman on all your nodes and it'll tell you who's in the cluster (kernel and userland api's). CMAN will also do generic callbacks to tell you when the membership has changed. Some people can stop reading here. In the event of network partitions you can obviously have two cman clusters form independently (i.e. "split-brain"). Some people care about this. Quorum is a trivial true/false property of the cluster. Every cluster member has a number of votes and the cluster itself has a number of expected votes. Using these simple values, cman does a quick computation to tell you if the cluster has quorum. It's a very standard way of doing things -- we modelled it directly off the VMS-cluster style. Whether you care about this quorum value or what you do with it are beside the point. Some may be interested in discussing how cman works and participating in further development; if so go ahead and ask on linux-cluster@redhat.com. We've been developing and using cman for 3-4 years. Are there other valid approaches? of course. Is cman suitable for many people? yes. Suitable for everyone? no. (see http://sources.redhat.com/cluster/ for patches and mailing list) What about the DLM? The DLM we've developed is again modelled exactly after that in VMS-clusters. It depends on cman for the necessary clustering input. Note that it uses the same generic cman api's as any other system. Again, the DLM is utterly symmetric; there is no server or master node involved. Is this DLM suitable for many people? yes. For everyone? no. (Right now gfs and clvm are the primary dlm users simply because those are the other projects our group works on. DLM is in no way specific to either of those.) What about Fencing? Fencing is not a part of the cluster manager, not a part of the dlm and not a part of gfs. It's an entirely independent system that runs on its own in userland. It depends on cman for cluster information just like the dlm or gfs does. I'll repeat what I said on the linux-cluster mailing list: -- Fencing is a service that runs on its own in a CMAN cluster; it's entirely independent from other services. GFS simply checks to verify fencing is running before allowing a mount since it's especially dangerous for a mount to succeed without it. As soon as a node joins a fencing domain it will be fenced by another domain member if it fails. i.e. as soon as a node runs: > cman_tool join (joins the cluster) > fence_tool join (starts fenced which joins the default fence domain) it will be fenced by another fence domain member if it fails. So, you simply need to configure your nodes to run fence_tool join after joining the cluster if you want fencing to happen. You can add any checks later on that you think are necessary to be sure that the node is in the fence domain. Running fence_tool leave will remove a node cleanly from the fence domain (it won't be fenced by other members.) -- This fencing system is suitable for us in our gfs/clvm work. It's probably suitable for others, too. For everyone? no. Can be improved with further development? yes. A central or difficult issue? not really. Again, no need to look at the dlm or gfs or clvm to work with this fencing system. -- Dave Teigland <teigland@redhat.com> ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-08 10:53 ` David Teigland @ 2004-07-08 14:14 ` Chris Friesen 2004-07-08 16:06 ` David Teigland 2004-07-08 18:22 ` Daniel Phillips 2004-07-12 10:14 ` Lars Marowsky-Bree 2 siblings, 1 reply; 55+ messages in thread From: Chris Friesen @ 2004-07-08 14:14 UTC (permalink / raw) To: David Teigland; +Cc: linux-kernel, Daniel Phillips, Lars Marowsky-Bree David Teigland wrote: > I'm afraid the fencing issue has been rather misrepresented. Here's > what we're > doing (a lot of background is necessary I'm afraid.) We have a symmetric, > kernel-based, stand-alone cluster manager (CMAN) that has no ties to > anything > else whatsoever. It'll simply run and answer the question "who's in the > cluster?" by providing a list of names/nodeids. > > So, if that's all you want you can just run cman on all your nodes and > it'll > tell you who's in the cluster (kernel and userland api's). CMAN will > also do > generic callbacks to tell you when the membership has changed. Some > people can > stop reading here. I'm curious--this seems to be exactly what the cluster membership portion of the SAF spec provides. Would it make sense to contribute to that portion of OpenAIS, then export the CMAN API on top of it for backwards compatibility? It just seems like there are a bunch of different cluster messaging, membership, etc. systems, and there is a lot of work being done in parallel with different implementations of the same functionality. Now that there is a standard emerging for clustering (good or bad, we've got people asking for it) would it make sense to try and get behind that standard and try and make a reference implementation? You guys are more experienced than I, but it seems a bit of a waste to see all these projects re-inventing the wheel. Chris ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-08 14:14 ` Chris Friesen @ 2004-07-08 16:06 ` David Teigland 0 siblings, 0 replies; 55+ messages in thread From: David Teigland @ 2004-07-08 16:06 UTC (permalink / raw) To: Chris Friesen; +Cc: linux-kernel, Daniel Phillips, Lars Marowsky-Bree > >I'm afraid the fencing issue has been rather misrepresented. Here's what > >we're doing (a lot of background is necessary I'm afraid.) We have a > >symmetric, kernel-based, stand-alone cluster manager (CMAN) that has no ties > >to anything else whatsoever. It'll simply run and answer the question > >"who's in the cluster?" by providing a list of names/nodeids. > > > >So, if that's all you want you can just run cman on all your nodes and it'll > >tell you who's in the cluster (kernel and userland api's). CMAN will also > >do generic callbacks to tell you when the membership has changed. Some > >people can stop reading here. > > I'm curious--this seems to be exactly what the cluster membership portion of > the SAF spec provides. Would it make sense to contribute to that portion of > OpenAIS, then export the CMAN API on top of it for backwards compatibility? That's definately worth investigating. If the SAF API is only of interest in userland, then perhaps a library can translate between the SAF api and the existing interface cman exports to userland. We'd welcome efforts to make cman itself more compatible with SAF, too. We're not very familiar with it, though. > It just seems like there are a bunch of different cluster messaging, > membership, etc. systems, and there is a lot of work being done in parallel > with different implementations of the same functionality. Now that there is > a standard emerging for clustering (good or bad, we've got people asking for > it) would it make sense to try and get behind that standard and try and make > a reference implementation? > > You guys are more experienced than I, but it seems a bit of a waste to see > all these projects re-inventing the wheel. Sure, we're happy to help make this code more useful to others. We wrote this for a very immediate and practical reason of course -- to support gfs, clvm, dlm, etc, but always expected it would be used more broadly. We've not done a lot of work with it lately since as I mentioned it was begun years ago. -- Dave Teigland <teigland@redhat.com> ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-08 10:53 ` David Teigland 2004-07-08 14:14 ` Chris Friesen @ 2004-07-08 18:22 ` Daniel Phillips 2004-07-08 19:41 ` Steven Dake 2004-07-12 10:14 ` Lars Marowsky-Bree 2 siblings, 1 reply; 55+ messages in thread From: Daniel Phillips @ 2004-07-08 18:22 UTC (permalink / raw) To: David Teigland; +Cc: linux-kernel, Lars Marowsky-Bree Hi Dave, On Thursday 08 July 2004 06:53, David Teigland wrote: > We have a symmetric, kernel-based, stand-alone cluster manager (CMAN) > that has no ties to anything else whatsoever. It'll simply run and > answer the question "who's in the cluster?" by providing a list of > names/nodeids. While we're in here, could you please explain why CMAN needs to be kernel-based? (Just thought I'd broach the question before Christoph does.) Regards, Daniel ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-08 18:22 ` Daniel Phillips @ 2004-07-08 19:41 ` Steven Dake 2004-07-10 4:58 ` David Teigland 2004-07-10 4:58 ` Daniel Phillips 0 siblings, 2 replies; 55+ messages in thread From: Steven Dake @ 2004-07-08 19:41 UTC (permalink / raw) To: Daniel Phillips; +Cc: David Teigland, linux-kernel, Lars Marowsky-Bree On Thu, 2004-07-08 at 11:22, Daniel Phillips wrote: > Hi Dave, > > On Thursday 08 July 2004 06:53, David Teigland wrote: > > We have a symmetric, kernel-based, stand-alone cluster manager (CMAN) > > that has no ties to anything else whatsoever. It'll simply run and > > answer the question "who's in the cluster?" by providing a list of > > names/nodeids. > > While we're in here, could you please explain why CMAN needs to be > kernel-based? (Just thought I'd broach the question before Christoph > does.) > > Regards, > > Daniel Daniel, I have that same question as well. I can think of several disadvantages: 1) security faults in the protocol can crash the kernel or violate system security 2) secure group communication is difficult to implement in kernel - secure group key protocols can be implemented fairly easily in userspace using packages like openssl. Implementing these protocols in kernel will prove to be very complex. 3) live upgrades are much more difficult with kernel components 4) a standard interface (the SA Forum AIS) is not being used, disallowing replaceability of components. This is a big deal for people interested in clustering that dont want to be locked into a partciular implementation. 5) dlm, fencing, cluster messaging (including membership) can be done in userspace, so why not do it there. 6) cluster services for the kernel and cluster services for applications will fork, because SA Forum AIS will be chosen for application level services. 7) faults in the protocols can bring down all of Linux, instead of one cluster service on one node. 8) kernel changes require much longer to get into the field and are much more difficult to distribute. userspace applications are much simpler to unit test, qualify, and release. The advantages are: interrupt driven timers some possible reduction in latency related to the cost of executing a system call when sending messages (including lock messages) I would like to share with you the efforts of the industry standards body Service Availability Forum (www.saforum.org). The Forum is intersted in specifying interfaces for improving availability of a system. One of the collections of APIs (called the application interface specification) utilizes redundant software components using clustering approaches to improve availability. The AIS specification specifies APIs for cluster membership, application failover, checkpointing, eventing, messaging, and distributed locks. All of these services are designed to work with multiple nodes. It would be beneficial to everyone to adopt these standard interfaces. Alot of thought has gone into them. They are pretty solid. And there are atleast two open source implementations under way (openais and linux-ha) and more on the horizon. One of these projects, the openais project which I maintain, implements 3 of these services (and the rest will be done in the timeframes we are talking about) in user space without any kernel changes required. It would be possible with kernel to userland communication for the cluster applications (GFS, distributed block device, etc) to use this standard interface and implementation. Then we could avoid all of the unnecessary kernel maintenance and potential problems that come along with it. Are you interested in such an approach? Thanks -steve > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-08 19:41 ` Steven Dake @ 2004-07-10 4:58 ` David Teigland 2004-07-10 4:58 ` Daniel Phillips 1 sibling, 0 replies; 55+ messages in thread From: David Teigland @ 2004-07-10 4:58 UTC (permalink / raw) To: linux-kernel; +Cc: Steven Dake, Daniel Phillips, Lars Marowsky-Bree On Thu, Jul 08, 2004 at 12:41:21PM -0700, Steven Dake wrote: > On Thu, 2004-07-08 at 11:22, Daniel Phillips wrote: > > Hi Dave, > > > > On Thursday 08 July 2004 06:53, David Teigland wrote: > > > We have a symmetric, kernel-based, stand-alone cluster manager (CMAN) > > > that has no ties to anything else whatsoever. It'll simply run and > > > answer the question "who's in the cluster?" by providing a list of > > > names/nodeids. > > > > While we're in here, could you please explain why CMAN needs to be > > kernel-based? (Just thought I'd broach the question before Christoph > > does.) > > I have that same question as well. gfs needs to run in the kernel. dlm should run in the kernel since gfs uses it so heavily. cman is the clustering subsystem on top of which both of those are built and on which both depend quite critically. It simply makes most sense to put cman in the kernel for what we're doing with it. That's not a dogmatic position, just a practical one based on our experience. > I can think of several disadvantages: > > 1) security faults in the protocol can crash the kernel or violate > system security > 2) secure group communication is difficult to implement in kernel > - secure group key protocols can be implemented fairly easily in > userspace using packages like openssl. Implementing these > protocols in kernel will prove to be very complex. > 3) live upgrades are much more difficult with kernel components > 4) a standard interface (the SA Forum AIS) is not being used, > disallowing replaceability of components. This is a big deal for > people interested in clustering that dont want to be locked into > a partciular implementation. > 5) dlm, fencing, cluster messaging (including membership) can be done > in userspace, so why not do it there. > 6) cluster services for the kernel and cluster services for applications > will fork, because SA Forum AIS will be chosen for application > level services. > 7) faults in the protocols can bring down all of Linux, instead of one > cluster service on one node. > 8) kernel changes require much longer to get into the field and are > much more difficult to distribute. userspace applications are much > simpler to unit test, qualify, and release. > > The advantages are: > interrupt driven timers > some possible reduction in latency related to the cost of executing a > system call when sending messages (including lock messages) This view of advantages/disadvantages seems sensible when working with your average userland clustering application. The SAF spec looks pretty nice in that context. I think gfs and a kernel-based dlm for gfs are a different story, though. They're different enough from other things that few of the same considerations seem practical. This has been our experience so far, things could possibly change for some next-generation (think time span of years). You'll note that gfs uses external, interchangable locking/cluster systems which makes it easy to look at alternatives. cman and dlm are what gfs/clvm use today; if they prove useful to others that's great, we'd even be happy to help make them more useful. -- Dave Teigland <teigland@redhat.com> ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-08 19:41 ` Steven Dake 2004-07-10 4:58 ` David Teigland @ 2004-07-10 4:58 ` Daniel Phillips 2004-07-10 17:59 ` Steven Dake 1 sibling, 1 reply; 55+ messages in thread From: Daniel Phillips @ 2004-07-10 4:58 UTC (permalink / raw) To: sdake; +Cc: Daniel Phillips, David Teigland, linux-kernel, Lars Marowsky-Bree Hi Steven, On Thursday 08 July 2004 15:41, Steven Dake wrote: > On Thu, 2004-07-08 at 11:22, Daniel Phillips wrote: > > While we're in here, could you please explain why CMAN needs to be > > kernel-based? (Just thought I'd broach the question before Christoph > > does.) > > Daniel, > > I have that same question as well. I can think of several > disadvantages: > > 1) security faults in the protocol can crash the kernel or violate > system security > 2) secure group communication is difficult to implement in kernel > - secure group key protocols can be implemented fairly easily in > userspace using packages like openssl. Implementing these > protocols in kernel will prove to be very complex. > 3) live upgrades are much more difficult with kernel components > 4) a standard interface (the SA Forum AIS) is not being used, > disallowing replaceability of components. This is a big deal for > people interested in clustering that dont want to be locked into > a partciular implementation. > 5) dlm, fencing, cluster messaging (including membership) can be done > in userspace, so why not do it there. > 6) cluster services for the kernel and cluster services for applications > will fork, because SA Forum AIS will be chosen for application > level services. > 7) faults in the protocols can bring down all of Linux, instead of one > cluster service on one node. > 8) kernel changes require much longer to get into the field and are > much more difficult to distribute. userspace applications are much > simpler to unit test, qualify, and release. > > The advantages are: > interrupt driven timers > some possible reduction in latency related to the cost of executing a > system call when sending messages (including lock messages) I'm not saying you're wrong, but I can think of an advantage you didn't mention: a service living in kernel will inherit the PF_MEMALLOC state of the process that called it, that is, a VM cache flushing task. A userspace service will not. A cluster block device in kernel may need to invoke some service in userspace at an inconvenient time. For example, suppose somebody spills coffee into a network node while another network node is in PF_MEMALLOC state, busily trying to write out dirty file data to it. The kernel block device now needs to yell to the user space service to go get it a new network connection. But the userspace service may need to allocate some memory to do that, and, whoops, the kernel won't give it any because it is in PF_MEMALLOC state. Now what? > One of these projects, the openais project which I maintain, implements > 3 of these services (and the rest will be done in the timeframes we are > talking about) in user space without any kernel changes required. It > would be possible with kernel to userland communication for the cluster > applications (GFS, distributed block device, etc) to use this standard > interface and implementation. Then we could avoid all of the > unnecessary kernel maintenance and potential problems that come along > with it. > > Are you interested in such an approach? We'd be remiss not to be aware of it, and its advantages. It seems your project is still in early stages. How about we take pains to ensure that your cluster membership service is plugable into the CMAN infrastructure, as a starting point. Though I admit I haven't read through the whole code tree, there doesn't seem to be a distributed lock manager there. Maybe that is because it's so tightly coded I missed it? Regards, Daniel ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-10 4:58 ` Daniel Phillips @ 2004-07-10 17:59 ` Steven Dake 2004-07-10 20:57 ` Daniel Phillips 0 siblings, 1 reply; 55+ messages in thread From: Steven Dake @ 2004-07-10 17:59 UTC (permalink / raw) To: Daniel Phillips Cc: Daniel Phillips, David Teigland, linux-kernel, Lars Marowsky-Bree Comments inline thanks -steve On Fri, 2004-07-09 at 21:58, Daniel Phillips wrote: > Hi Steven, > > On Thursday 08 July 2004 15:41, Steven Dake wrote: > > On Thu, 2004-07-08 at 11:22, Daniel Phillips wrote: > > > While we're in here, could you please explain why CMAN needs to be > > > kernel-based? (Just thought I'd broach the question before Christoph > > > does.) > > > > Daniel, > > > > I have that same question as well. I can think of several > > disadvantages: > > > > 1) security faults in the protocol can crash the kernel or violate > > system security > > 2) secure group communication is difficult to implement in kernel > > - secure group key protocols can be implemented fairly easily in > > userspace using packages like openssl. Implementing these > > protocols in kernel will prove to be very complex. > > 3) live upgrades are much more difficult with kernel components > > 4) a standard interface (the SA Forum AIS) is not being used, > > disallowing replaceability of components. This is a big deal for > > people interested in clustering that dont want to be locked into > > a partciular implementation. > > 5) dlm, fencing, cluster messaging (including membership) can be done > > in userspace, so why not do it there. > > 6) cluster services for the kernel and cluster services for applications > > will fork, because SA Forum AIS will be chosen for application > > level services. > > 7) faults in the protocols can bring down all of Linux, instead of one > > cluster service on one node. > > 8) kernel changes require much longer to get into the field and are > > much more difficult to distribute. userspace applications are much > > simpler to unit test, qualify, and release. > > > > The advantages are: > > interrupt driven timers > > some possible reduction in latency related to the cost of executing a > > system call when sending messages (including lock messages) > > I'm not saying you're wrong, but I can think of an advantage you didn't > mention: a service living in kernel will inherit the PF_MEMALLOC state of the > process that called it, that is, a VM cache flushing task. A userspace > service will not. A cluster block device in kernel may need to invoke some > service in userspace at an inconvenient time. > > For example, suppose somebody spills coffee into a network node while another > network node is in PF_MEMALLOC state, busily trying to write out dirty file > data to it. The kernel block device now needs to yell to the user space > service to go get it a new network connection. But the userspace service may > need to allocate some memory to do that, and, whoops, the kernel won't give > it any because it is in PF_MEMALLOC state. Now what? > overload conditions that have caused the kernel to run low on memory are a difficult problem, even for kernel components. Currently openais includes "memory pools" which preallocate data structures. While that work is not yet complete, the intent is to ensure every data area is preallocated so the openais executive (the thing that does all of the work) doesn't ever request extra memory once it becomes operational. This of course, leads to problems in the following system calls which openais uses extensively: sys_poll sys_recvmsg sys_sendmsg which require the allocations of memory with GFP_KERNEL, which can then fail returning ENOMEM to userland. The openais protocol currently can handle low memory failures in recvmsg and sendmsg. This is because it uses a protocol designed to operate on lossy networks. The poll system call problem will be rectified by utilizing sys_epoll_wait which does not allocate any memory (the poll data is preallocated). I hope that helps atleast answer that some r&d is underway to solve this particular overload problem in userspace. > > One of these projects, the openais project which I maintain, implements > > 3 of these services (and the rest will be done in the timeframes we are > > talking about) in user space without any kernel changes required. It > > would be possible with kernel to userland communication for the cluster > > applications (GFS, distributed block device, etc) to use this standard > > interface and implementation. Then we could avoid all of the > > unnecessary kernel maintenance and potential problems that come along > > with it. > > > > Are you interested in such an approach? > > We'd be remiss not to be aware of it, and its advantages. It seems your > project is still in early stages. How about we take pains to ensure that > your cluster membership service is plugable into the CMAN infrastructure, as > a starting point. > sounds good > Though I admit I haven't read through the whole code tree, there doesn't seem > to be a distributed lock manager there. Maybe that is because it's so > tightly coded I missed it? > There is as of yet no implementation of the SAF AIS dlock API in openais. The work requires about 4 weeks of development for someone well-skilled. I'd expect a contribution for this API in the timeframes that make GFS interesting. I'd invite you, or others interested in these sorts of services, to contribute that code, if interested. If interested in developing such a service for openais, check out the developer's map (which describes developing a service for openais) at: http://developer.osdl.org/dev/openais/src/README.devmap Thanks! -steve > Regards, > > Daniel ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-10 17:59 ` Steven Dake @ 2004-07-10 20:57 ` Daniel Phillips 2004-07-10 23:24 ` Steven Dake 0 siblings, 1 reply; 55+ messages in thread From: Daniel Phillips @ 2004-07-10 20:57 UTC (permalink / raw) To: sdake; +Cc: Daniel Phillips, David Teigland, linux-kernel, Lars Marowsky-Bree On Saturday 10 July 2004 13:59, Steven Dake wrote: > > I'm not saying you're wrong, but I can think of an advantage you > > didn't mention: a service living in kernel will inherit the > > PF_MEMALLOC state of the process that called it, that is, a VM > > cache flushing task. A userspace service will not. A cluster > > block device in kernel may need to invoke some service in userspace > > at an inconvenient time. > > > > For example, suppose somebody spills coffee into a network node > > while another network node is in PF_MEMALLOC state, busily trying > > to write out dirty file data to it. The kernel block device now > > needs to yell to the user space service to go get it a new network > > connection. But the userspace service may need to allocate some > > memory to do that, and, whoops, the kernel won't give it any > > because it is in PF_MEMALLOC state. Now what? > > overload conditions that have caused the kernel to run low on memory > are a difficult problem, even for kernel components. Currently > openais includes "memory pools" which preallocate data structures. > While that work is not yet complete, the intent is to ensure every > data area is preallocated so the openais executive (the thing that > does all of the work) doesn't ever request extra memory once it > becomes operational. > > This of course, leads to problems in the following system calls which > openais uses extensively: > sys_poll > sys_recvmsg > sys_sendmsg > > which require the allocations of memory with GFP_KERNEL, which can > then fail returning ENOMEM to userland. The openais protocol > currently can handle low memory failures in recvmsg and sendmsg. > This is because it uses a protocol designed to operate on lossy > networks. > > The poll system call problem will be rectified by utilizing > sys_epoll_wait which does not allocate any memory (the poll data is > preallocated). But if the user space service is sitting in the kernel's dirty memory writeout path, you have a real problem: the low memory condition may never get resolved, rendering your userspace service autistic. Meanwhile, whoever is generating the dirty memory just keeps spinning and spinning, generating more of it, ensuring that if the system does survive the first incident, there's another, worse traffic jam coming down the pipe. To trigger this deadlock, a kernel filesystem or block device module just has to lose its cluster connection(s) at the wrong time. > I hope that helps atleast answer that some r&d is underway to solve > this particular overload problem in userspace. I'm certain there's a solution, but until it is demonstrated and proved, any userspace cluster services must be regarded with narrow squinty eyes. > > Though I admit I haven't read through the whole code tree, there > > doesn't seem to be a distributed lock manager there. Maybe that is > > because it's so tightly coded I missed it? > > There is as of yet no implementation of the SAF AIS dlock API in > openais. The work requires about 4 weeks of development for someone > well-skilled. I'd expect a contribution for this API in the > timeframes that make GFS interesting. I suspect you have underestimated the amount of development time required. > I'd invite you, or others interested in these sorts of services, to > contribute that code, if interested. Humble suggestion: try grabbing the Red Hat (Sistina) DLM code and see if you can hack it to do what you want. Just write a kernel module that exports the DLM interface to userspace in the desired form. http://sources.redhat.com/cluster/dlm/ Regards, Daniel ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-10 20:57 ` Daniel Phillips @ 2004-07-10 23:24 ` Steven Dake 2004-07-11 19:44 ` Daniel Phillips 0 siblings, 1 reply; 55+ messages in thread From: Steven Dake @ 2004-07-10 23:24 UTC (permalink / raw) To: Daniel Phillips Cc: Daniel Phillips, David Teigland, linux-kernel, Lars Marowsky-Bree some comments inline On Sat, 2004-07-10 at 13:57, Daniel Phillips wrote: > On Saturday 10 July 2004 13:59, Steven Dake wrote: > > > I'm not saying you're wrong, but I can think of an advantage you > > > didn't mention: a service living in kernel will inherit the > > > PF_MEMALLOC state of the process that called it, that is, a VM > > > cache flushing task. A userspace service will not. A cluster > > > block device in kernel may need to invoke some service in userspace > > > at an inconvenient time. > > > > > > For example, suppose somebody spills coffee into a network node > > > while another network node is in PF_MEMALLOC state, busily trying > > > to write out dirty file data to it. The kernel block device now > > > needs to yell to the user space service to go get it a new network > > > connection. But the userspace service may need to allocate some > > > memory to do that, and, whoops, the kernel won't give it any > > > because it is in PF_MEMALLOC state. Now what? > > > > overload conditions that have caused the kernel to run low on memory > > are a difficult problem, even for kernel components. Currently > > openais includes "memory pools" which preallocate data structures. > > While that work is not yet complete, the intent is to ensure every > > data area is preallocated so the openais executive (the thing that > > does all of the work) doesn't ever request extra memory once it > > becomes operational. > > > > This of course, leads to problems in the following system calls which > > openais uses extensively: > > sys_poll > > sys_recvmsg > > sys_sendmsg > > > > which require the allocations of memory with GFP_KERNEL, which can > > then fail returning ENOMEM to userland. The openais protocol > > currently can handle low memory failures in recvmsg and sendmsg. > > This is because it uses a protocol designed to operate on lossy > > networks. > > > > The poll system call problem will be rectified by utilizing > > sys_epoll_wait which does not allocate any memory (the poll data is > > preallocated). > > But if the user space service is sitting in the kernel's dirty memory > writeout path, you have a real problem: the low memory condition may > never get resolved, rendering your userspace service autistic. > Meanwhile, whoever is generating the dirty memory just keeps spinning > and spinning, generating more of it, ensuring that if the system does > survive the first incident, there's another, worse traffic jam coming > down the pipe. To trigger this deadlock, a kernel filesystem or block > device module just has to lose its cluster connection(s) at the wrong > time. > > > I hope that helps atleast answer that some r&d is underway to solve > > this particular overload problem in userspace. > > I'm certain there's a solution, but until it is demonstrated and proved, > any userspace cluster services must be regarded with narrow squinty > eyes. > I agree that a solution must be demonstrated and proved. There is another option, which I regularly recommend to anyone that must deal with memory overload conditions. Don't size the applications in such a way as to ever cause memory overload. This practical approach requires just a little more thought on application deployment with the benefit of avoiding the various and many problems with memory overload that leads to application faults, OS faults, and other sorts of nasty conditions. > > > Though I admit I haven't read through the whole code tree, there > > > doesn't seem to be a distributed lock manager there. Maybe that is > > > because it's so tightly coded I missed it? > > > > There is as of yet no implementation of the SAF AIS dlock API in > > openais. The work requires about 4 weeks of development for someone > > well-skilled. I'd expect a contribution for this API in the > > timeframes that make GFS interesting. > > I suspect you have underestimated the amount of development time > required. > The checkpointing api took approx 3 weeks to develop and has many more functions to implement. Cluster membership took approx 1 week to develop. The AMF which provides application failover, the most complicated of the APIs, took approx 8 weeks to develop. The group messaging protocol (which implements the virtual synchrony model) has consumed 80% of the development time thus far. So 4 weeks is reasonable for someone not familiar with the openais architecture or SA Forum specification, since the virtual synchrony group messaging protocol is complete enough to implement a lock service with simple messaging without any race conditions even during network partitions and merges. > > I'd invite you, or others interested in these sorts of services, to > > contribute that code, if interested. > > Humble suggestion: try grabbing the Red Hat (Sistina) DLM code and see > if you can hack it to do what you want. Just write a kernel module > that exports the DLM interface to userspace in the desired form. > > http://sources.redhat.com/cluster/dlm/ > I would rather avoid non-mainline kernel dependencies at this time as it makes adoption difficult until kernel patches are merged into upstream code. Who wants to patch their kernel to try out some APIs? I am doubtful these sort of kernel patches will be merged without a strong argument of why it absolutely must be implemented in the kernel vs all of the counter arguments against a kernel implementation. There is one more advantage to group messaging and distributed locking implemented within the kernel, that I hadn't originally considered; it sure is sexy. Regards -steve > Regards, > > Daniel ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-10 23:24 ` Steven Dake @ 2004-07-11 19:44 ` Daniel Phillips 2004-07-11 21:06 ` Lars Marowsky-Bree 2004-07-12 4:08 ` Steven Dake 0 siblings, 2 replies; 55+ messages in thread From: Daniel Phillips @ 2004-07-11 19:44 UTC (permalink / raw) To: sdake; +Cc: Daniel Phillips, David Teigland, linux-kernel, Lars Marowsky-Bree On Saturday 10 July 2004 19:24, Steven Dake wrote: > On Sat, 2004-07-10 at 13:57, Daniel Phillips wrote: > > On Saturday 10 July 2004 13:59, Steven Dake wrote: > > > overload conditions that have caused the kernel to run low on memory > > > are a difficult problem, even for kernel components... > > > ...I hope that helps atleast answer that some r&d is underway to solve > > > this particular overload problem in userspace. > > > > I'm certain there's a solution, but until it is demonstrated and proved, > > any userspace cluster services must be regarded with narrow squinty > > eyes. > > I agree that a solution must be demonstrated and proved. > > There is another option, which I regularly recommend to anyone that > must deal with memory overload conditions. Don't size the applications > in such a way as to ever cause memory overload. That, and "just add more memory" are the two common mistakes people make when thinking about this problem. The kernel _normally_ runs near the low-memory barrier, on the theory that caching as much as possible is a good thing. Unless you can prove that your userspace approach never deadlocks, the other questions don't even move the needle. I am sure that one day somebody, maybe you, will demonstrate a userspace approach that is provably correct. Until then, if you want your cluster to stay up and fail over properly, there's only one game in town. We need to worry about ensuring that no API _depends_ on the cluster manager being in-kernel, and we also need to seek out and excise any parts that could possibly be moved out to user space without enabling the deadlock or grossly messing up the kernel code. > > > I'd invite you, or others interested in these sorts of services, to > > > contribute that code, if interested. > > > > Humble suggestion: try grabbing the Red Hat (Sistina) DLM code and see > > if you can hack it to do what you want. Just write a kernel module > > that exports the DLM interface to userspace in the desired form. > > > > http://sources.redhat.com/cluster/dlm/ > > I would rather avoid non-mainline kernel dependencies at this time as it > makes adoption difficult until kernel patches are merged into upstream > code. Who wants to patch their kernel to try out some APIs? Everybody working on clusters. It's a fact of life that you have to apply patches to run cluster filesystems right now. Production will be a different story, but (except for the stable GFS code on 2.4) nobody is close to that. > I am doubtful these sort of kernel patches will be merged without a strong > argument of why it absolutely must be implemented in the kernel vs all > of the counter arguments against a kernel implementation. True. Do you agree that the PF_MEMALLOC argument is a strong one? > There is one more advantage to group messaging and distributed locking > implemented within the kernel, that I hadn't originally considered; it > sure is sexy. I don't think it's sexy, I think it's ugly, to tell the truth. I am actively researching how to move the slow-path cluster infrastructure out of kernel, and I would be pleased to work together with anyone else who is interested in this nasty problem. Regards, Daniel ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-11 19:44 ` Daniel Phillips @ 2004-07-11 21:06 ` Lars Marowsky-Bree 2004-07-12 6:58 ` Arjan van de Ven 2004-07-12 4:08 ` Steven Dake 1 sibling, 1 reply; 55+ messages in thread From: Lars Marowsky-Bree @ 2004-07-11 21:06 UTC (permalink / raw) To: Daniel Phillips, sdake; +Cc: David Teigland, linux-kernel On 2004-07-11T15:44:25, Daniel Phillips <phillips@istop.com> said: > Unless you can prove that your userspace approach never deadlocks, the other > questions don't even move the needle. I am sure that one day somebody, maybe > you, will demonstrate a userspace approach that is provably correct. If you can _prove_ your kernel-space implementation to be correct, I'll drop all and every single complaint ;) > Until then, if you want your cluster to stay up and fail over > properly, there's only one game in town. This however is not true; clusters have managed just fine running in user-space (realtime priority, mlocked into (pre-allocated) memory etc). I agree that for a cluster filesystem it's much lower latency to have the infrastructure in the kernel. Going back and forth to user-land just ain't as fast and also not very neat. However, the memory argument is pretty weak; the memory for heartbeating and core functionality must be pre-allocated if you care that much. And if you cannot allocate it, maybe you ain't healthy enough to join the cluster in the first place. Otherwise, I don't much care about whether it's in-kernel or not. My main argument against being in the kernel space has always been portability and ease of integration, which makes this quite annoying for ISVs, and the support issues which arise. But if it's however a common component part of the 'kernel proper', then this argument no longer holds. If the infrastructure takes that jump, I'd be happy. Infrastructure is boring and has been solved/reinvented so often there's hardly anything new and exciting about heartbeating, membership, there's more fun work higher up the stack. > > There is one more advantage to group messaging and distributed > > locking implemented within the kernel, that I hadn't originally > > considered; it sure is sexy. > I don't think it's sexy, I think it's ugly, to tell the truth. I am > actively researching how to move the slow-path cluster infrastructure > out of kernel, and I would be pleased to work together with anyone > else who is interested in this nasty problem. Messaging (which hopefully includes strong authentication if not encryption, though I could see that being delegated to IPsec) and locking is in the fast-path, though. Sincerely, Lars Marowsky-Brée <lmb@suse.de> -- High Availability & Clustering \ ever tried. ever failed. no matter. SUSE Labs, Research and Development | try again. fail again. fail better. SUSE LINUX AG - A Novell company \ -- Samuel Beckett ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-11 21:06 ` Lars Marowsky-Bree @ 2004-07-12 6:58 ` Arjan van de Ven 2004-07-12 10:05 ` Lars Marowsky-Bree 0 siblings, 1 reply; 55+ messages in thread From: Arjan van de Ven @ 2004-07-12 6:58 UTC (permalink / raw) To: Lars Marowsky-Bree; +Cc: Daniel Phillips, sdake, David Teigland, linux-kernel [-- Attachment #1: Type: text/plain, Size: 793 bytes --] > > This however is not true; clusters have managed just fine running in > user-space (realtime priority, mlocked into (pre-allocated) memory > etc). (ignoring the entire context and argument) Running realtime and mlocked (prealloced) is most certainly not sufficient for causes like this; any system call that internally allocates memory (even if it's just for allocating the kernel side of the filename you handle to open) can lead to this RT, mlocked process to cause VM writeout elsewhere. While I can't say how this affects your argument, everyone should be really careful with the "just mlock it" argument because it just doesn't help the worst case in scenarios like this. (It most obviously helps the average case so for soft-realtime use it's a good approach) [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-12 6:58 ` Arjan van de Ven @ 2004-07-12 10:05 ` Lars Marowsky-Bree 2004-07-12 10:11 ` Arjan van de Ven 0 siblings, 1 reply; 55+ messages in thread From: Lars Marowsky-Bree @ 2004-07-12 10:05 UTC (permalink / raw) To: Arjan van de Ven; +Cc: Daniel Phillips, sdake, David Teigland, linux-kernel On 2004-07-12T08:58:46, Arjan van de Ven <arjanv@redhat.com> said: > Running realtime and mlocked (prealloced) is most certainly not > sufficient for causes like this; any system call that internally > allocates memory (even if it's just for allocating the kernel side of > the filename you handle to open) can lead to this RT, mlocked process to > cause VM writeout elsewhere. Of course; appropriate safety measures - like not doing any syscall which could potentially block, or isolating them from the main task via double-buffering childs - need to be done. (heartbeat does this in fact.) Again, if we have "many" in kernel users requiring high performance & low-latency, running in the kernel may not be as bad, but I still don't entirely like it. But user-space can also manage just fine, and instead continuing the "we need highperf, low-latency and non-blocking so it must be in the kernel", we may want to consider how to have high-perf low-latency kernel/user-space communication so that we can NOT move this into the kernel. Suffice to say that many user-space implementations exist which satisfy these needs quite sufficiently; in the case of a CFS, this argument may be different, but I'd like to see some hard data to back it up. (On a practical note, a system which drops out of membership because allocating a 256 byte buffer for a filename takes longer than the node deadtime (due to high load) is reasonably unlikely to be a healthy cluster member anyway and is on its road to eviction already.) The main reason why I'd like to see cluster infrastructure in the kernel is not technical, but because it increases the pressure on unification so much that people might actually get their act together this time ;-) Sincerely, Lars Marowsky-Brée <lmb@suse.de> -- High Availability & Clustering \ ever tried. ever failed. no matter. SUSE Labs, Research and Development | try again. fail again. fail better. SUSE LINUX AG - A Novell company \ -- Samuel Beckett ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-12 10:05 ` Lars Marowsky-Bree @ 2004-07-12 10:11 ` Arjan van de Ven 2004-07-12 10:21 ` Lars Marowsky-Bree 0 siblings, 1 reply; 55+ messages in thread From: Arjan van de Ven @ 2004-07-12 10:11 UTC (permalink / raw) To: Lars Marowsky-Bree; +Cc: Daniel Phillips, sdake, David Teigland, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1173 bytes --] On Mon, Jul 12, 2004 at 12:05:47PM +0200, Lars Marowsky-Bree wrote: > On 2004-07-12T08:58:46, > Arjan van de Ven <arjanv@redhat.com> said: > > > Running realtime and mlocked (prealloced) is most certainly not > > sufficient for causes like this; any system call that internally > > allocates memory (even if it's just for allocating the kernel side of > > the filename you handle to open) can lead to this RT, mlocked process to > > cause VM writeout elsewhere. > > Of course; appropriate safety measures - like not doing any syscall > which could potentially block, or isolating them from the main task via > double-buffering childs - need to be done. (heartbeat does this in > fact.) well the problem is that you cannot prevent a syscall from blocking really. O_NONBLOCK only impacts the waiting for IO/socket buffer space to not do so (in general), it doesn't impact the memory allocation strategies by syscalls. And there's a whopping lot of that in the non-boring syscalls... So while your heartbeat process won't block during getpid, it'll eventually need to do real work too .... and I'm quite certain that will lead down to GFP_KERNEL memory allocations. [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-12 10:11 ` Arjan van de Ven @ 2004-07-12 10:21 ` Lars Marowsky-Bree 2004-07-12 10:28 ` Arjan van de Ven 2004-07-14 8:32 ` Pavel Machek 0 siblings, 2 replies; 55+ messages in thread From: Lars Marowsky-Bree @ 2004-07-12 10:21 UTC (permalink / raw) To: Arjan van de Ven; +Cc: Daniel Phillips, sdake, David Teigland, linux-kernel On 2004-07-12T12:11:07, Arjan van de Ven <arjanv@redhat.com> said: > well the problem is that you cannot prevent a syscall from blocking really. > O_NONBLOCK only impacts the waiting for IO/socket buffer space to not do so > (in general), it doesn't impact the memory allocation strategies by > syscalls. And there's a whopping lot of that in the non-boring syscalls... > So while your heartbeat process won't block during getpid, it'll eventually > need to do real work too .... and I'm quite certain that will lead down to > GFP_KERNEL memory allocations. Sure, but the network IO is isolated from the main process via a _very careful_ non-blocking IO using sockets library, so that works out well. The only scenario which could still impact this severely would be that the kernel did not schedule the soft-rr tasks often enough or all NICs being so overloaded that we can no longer send out the heartbeat packets, and some more silly conditions. In either case I'd venture that said node is so unhealthy that it is quite rightfully evicted from the cluster. A node which is so overloaded should not be starting any new resources whatsoever. However, of course this is more difficult for the case where you are in the write path needed to free some memory; alas, swapping to a GFS mount is probably a realllllly silly idea, too. But again, I'd rather like to see this solved (memory pools for userland, PF_ etc), because it's relevant for many scenarios requiring near-hard-realtime properties, and the answer surely can't be to push it all into the kernel. Sincerely, Lars Marowsky-Brée <lmb@suse.de> -- High Availability & Clustering \ ever tried. ever failed. no matter. SUSE Labs, Research and Development | try again. fail again. fail better. SUSE LINUX AG - A Novell company \ -- Samuel Beckett ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-12 10:21 ` Lars Marowsky-Bree @ 2004-07-12 10:28 ` Arjan van de Ven 2004-07-12 11:50 ` Lars Marowsky-Bree 2004-07-14 8:32 ` Pavel Machek 1 sibling, 1 reply; 55+ messages in thread From: Arjan van de Ven @ 2004-07-12 10:28 UTC (permalink / raw) To: Lars Marowsky-Bree; +Cc: Daniel Phillips, sdake, David Teigland, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1450 bytes --] On Mon, Jul 12, 2004 at 12:21:24PM +0200, Lars Marowsky-Bree wrote: > On 2004-07-12T12:11:07, > Arjan van de Ven <arjanv@redhat.com> said: > > > well the problem is that you cannot prevent a syscall from blocking really. > > O_NONBLOCK only impacts the waiting for IO/socket buffer space to not do so > > (in general), it doesn't impact the memory allocation strategies by > > syscalls. And there's a whopping lot of that in the non-boring syscalls... > > So while your heartbeat process won't block during getpid, it'll eventually > > need to do real work too .... and I'm quite certain that will lead down to > > GFP_KERNEL memory allocations. > > Sure, but the network IO is isolated from the main process via a _very > careful_ non-blocking IO using sockets library, so that works out well. ... which of course never allocates skb's ? ;) > However, of course this is more difficult for the case where you are in > the write path needed to free some memory; alas, swapping to a GFS mount > is probably a realllllly silly idea, too. there is more than swap, there's dirty pagecache/mmaps as well > But again, I'd rather like to see this solved (memory pools for > userland, PF_ etc), because it's relevant for many scenarios requiring PF_ is not enough really ;) You need to force GFP_NOFS etc for several critical parts, and well, by being in kernel you can avoid a bunch of these allocations for real, and/or influence their GFP flags [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-12 10:28 ` Arjan van de Ven @ 2004-07-12 11:50 ` Lars Marowsky-Bree 2004-07-12 12:01 ` Arjan van de Ven 0 siblings, 1 reply; 55+ messages in thread From: Lars Marowsky-Bree @ 2004-07-12 11:50 UTC (permalink / raw) To: Arjan van de Ven; +Cc: Daniel Phillips, sdake, David Teigland, linux-kernel On 2004-07-12T12:28:19, Arjan van de Ven <arjanv@redhat.com> said: > > Sure, but the network IO is isolated from the main process via a _very > > careful_ non-blocking IO using sockets library, so that works out well. > ... which of course never allocates skb's ? ;) No, the interprocess communication does not; it's local sockets. I think Alan (Robertson) even has a paper on this. It's really quite well engineered, with a non-blocking poll() implementation based on signals and stuff. Oh well. > > But again, I'd rather like to see this solved (memory pools for > > userland, PF_ etc), because it's relevant for many scenarios requiring > PF_ is not enough really ;) > You need to force GFP_NOFS etc for several critical parts, and well, by > being in kernel you can avoid a bunch of these allocations for real, and/or > influence their GFP flags True enough, but I'm somewhat unhappy with this still. So whenever we have something like that we need to move it into the kernel space? (pvmove first, and now the clustering etc.) Can't we come up with a way to export this flag to user-space? Sincerely, Lars Marowsky-Brée <lmb@suse.de> -- High Availability & Clustering \ ever tried. ever failed. no matter. SUSE Labs, Research and Development | try again. fail again. fail better. SUSE LINUX AG - A Novell company \ -- Samuel Beckett ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-12 11:50 ` Lars Marowsky-Bree @ 2004-07-12 12:01 ` Arjan van de Ven 2004-07-12 13:13 ` Lars Marowsky-Bree 0 siblings, 1 reply; 55+ messages in thread From: Arjan van de Ven @ 2004-07-12 12:01 UTC (permalink / raw) To: Lars Marowsky-Bree; +Cc: Daniel Phillips, sdake, David Teigland, linux-kernel [-- Attachment #1: Type: text/plain, Size: 480 bytes --] On Mon, Jul 12, 2004 at 01:50:03PM +0200, Lars Marowsky-Bree wrote: > > True enough, but I'm somewhat unhappy with this still. So whenever we > have something like that we need to move it into the kernel space? > (pvmove first, and now the clustering etc.) Can't we come up with a way > to export this flag to user-space? I'm not convinced that's a good idea, in that it exposes what is basically VM internals to userspace, which then would become a set-in-stone interface.... [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-12 12:01 ` Arjan van de Ven @ 2004-07-12 13:13 ` Lars Marowsky-Bree 2004-07-12 13:40 ` Nick Piggin 0 siblings, 1 reply; 55+ messages in thread From: Lars Marowsky-Bree @ 2004-07-12 13:13 UTC (permalink / raw) To: Arjan van de Ven; +Cc: Daniel Phillips, sdake, David Teigland, linux-kernel On 2004-07-12T14:01:27, Arjan van de Ven <arjanv@redhat.com> said: > I'm not convinced that's a good idea, in that it exposes what is > basically VM internals to userspace, which then would become a > set-in-stone interface.... But I'm also not a big fan of moving all HA relevant infrastructure into the kernel. Membership and DLM are the first ones; then follows messaging (and reliable and globally ordered messaging is somewhat complex - but if one node is slow, it will hurt global communication too, so...), next someone argues that a node always must be able to report which resources it holds and fence other nodes even under memory pressure, and there goes the cluster resource manager and fencing subsystem into the kernel too etc... Where's the border? And what can we do to make critical user-space infrastructure run reliably and with deterministic-enough & low latency instead of moving it all into the kernel? Yes, the kernel solves these problems right now, but is that really the path we want to head down? Maybe it is, I'm not sure, afterall we also have the entire regular network stack in the kernel, but maybe also it is not. Sincerely, Lars Marowsky-Brée <lmb@suse.de> -- High Availability & Clustering \ ever tried. ever failed. no matter. SUSE Labs, Research and Development | try again. fail again. fail better. SUSE LINUX AG - A Novell company \ -- Samuel Beckett ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-12 13:13 ` Lars Marowsky-Bree @ 2004-07-12 13:40 ` Nick Piggin 2004-07-12 20:54 ` Andrew Morton 0 siblings, 1 reply; 55+ messages in thread From: Nick Piggin @ 2004-07-12 13:40 UTC (permalink / raw) To: Lars Marowsky-Bree Cc: Arjan van de Ven, Daniel Phillips, sdake, David Teigland, linux-kernel Lars Marowsky-Bree wrote: > On 2004-07-12T14:01:27, > Arjan van de Ven <arjanv@redhat.com> said: > > >>I'm not convinced that's a good idea, in that it exposes what is >>basically VM internals to userspace, which then would become a >>set-in-stone interface.... > > > But I'm also not a big fan of moving all HA relevant infrastructure into > the kernel. Membership and DLM are the first ones; then follows > messaging (and reliable and globally ordered messaging is somewhat > complex - but if one node is slow, it will hurt global communication > too, so...), next someone argues that a node always must be able to > report which resources it holds and fence other nodes even under memory > pressure, and there goes the cluster resource manager and fencing > subsystem into the kernel too etc... > > Where's the border? > > And what can we do to make critical user-space infrastructure run > reliably and with deterministic-enough & low latency instead of moving > it all into the kernel? > > Yes, the kernel solves these problems right now, but is that really the > path we want to head down? Maybe it is, I'm not sure, afterall we also > have the entire regular network stack in the kernel, but maybe also it > is not. > I don't see why it would be a problem to implement a "this task facilitates page reclaim" flag for userspace tasks that would take care of this as well as the kernel does. There would probably be a few technical things to work out (like GFP_NOFS), but I think it would be pretty trivial to implement. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-12 13:40 ` Nick Piggin @ 2004-07-12 20:54 ` Andrew Morton 2004-07-13 2:19 ` Daniel Phillips 2004-07-14 12:19 ` Pavel Machek 0 siblings, 2 replies; 55+ messages in thread From: Andrew Morton @ 2004-07-12 20:54 UTC (permalink / raw) To: Nick Piggin; +Cc: lmb, arjanv, phillips, sdake, teigland, linux-kernel Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > I don't see why it would be a problem to implement a "this task > facilitates page reclaim" flag for userspace tasks that would take > care of this as well as the kernel does. Yes, that has been done before, and it works - userspace "block drivers" which permanently mark themselves as PF_MEMALLOC to avoid the obvious deadlocks. Note that you can achieve a similar thing in current 2.6 by acquiring realtime scheduling policy, but that's an artifact of some brainwave which a VM hacker happened to have and isn't a thing which should be relied upon. A privileged syscall which allows a task to mark itself as one which cleans memory would make sense. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-12 20:54 ` Andrew Morton @ 2004-07-13 2:19 ` Daniel Phillips 2004-07-13 2:31 ` Nick Piggin 2004-07-14 12:19 ` Pavel Machek 1 sibling, 1 reply; 55+ messages in thread From: Daniel Phillips @ 2004-07-13 2:19 UTC (permalink / raw) To: Andrew Morton; +Cc: Nick Piggin, lmb, arjanv, sdake, teigland, linux-kernel Hi Andrew, On Monday 12 July 2004 16:54, Andrew Morton wrote: > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > I don't see why it would be a problem to implement a "this task > > facilitates page reclaim" flag for userspace tasks that would take > > care of this as well as the kernel does. > > Yes, that has been done before, and it works - userspace "block drivers" > which permanently mark themselves as PF_MEMALLOC to avoid the obvious > deadlocks. > > Note that you can achieve a similar thing in current 2.6 by acquiring > realtime scheduling policy, but that's an artifact of some brainwave which > a VM hacker happened to have and isn't a thing which should be relied upon. Do you have a pointer to the brainwave? > A privileged syscall which allows a task to mark itself as one which > cleans memory would make sense. For now we can do it with an ioctl, and we pretty much have to do it for pvmove. But that's when user space drives the kernel by syscalls; there is also the nasty (and common) case where the kernel needs userspace to do something for it while it's in PF_MEMALLOC. I'm playing with ideas there, but nothing I'm proud of yet. For now I see the in-kernel approach as the conservative one, for anything that could possibly find itself on the VM writeout path. Unfortunately, that may include some messy things like authentication. I'd really like to solve this reliable-userspace problem. We'd still have lots of arguments left to resolve about where things should be, but at least we'd have the choice. Regards, Daniel ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-13 2:19 ` Daniel Phillips @ 2004-07-13 2:31 ` Nick Piggin 2004-07-27 3:31 ` Daniel Phillips 0 siblings, 1 reply; 55+ messages in thread From: Nick Piggin @ 2004-07-13 2:31 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andrew Morton, lmb, arjanv, sdake, teigland, linux-kernel Daniel Phillips wrote: > Hi Andrew, > > On Monday 12 July 2004 16:54, Andrew Morton wrote: > >>Nick Piggin <nickpiggin@yahoo.com.au> wrote: >> >>>I don't see why it would be a problem to implement a "this task >>>facilitates page reclaim" flag for userspace tasks that would take >>>care of this as well as the kernel does. >> >>Yes, that has been done before, and it works - userspace "block drivers" >>which permanently mark themselves as PF_MEMALLOC to avoid the obvious >>deadlocks. >> >>Note that you can achieve a similar thing in current 2.6 by acquiring >>realtime scheduling policy, but that's an artifact of some brainwave which >>a VM hacker happened to have and isn't a thing which should be relied upon. > > > Do you have a pointer to the brainwave? > Search for rt_task in mm/page_alloc.c > >>A privileged syscall which allows a task to mark itself as one which >>cleans memory would make sense. > > > For now we can do it with an ioctl, and we pretty much have to do it for > pvmove. But that's when user space drives the kernel by syscalls; there is > also the nasty (and common) case where the kernel needs userspace to do > something for it while it's in PF_MEMALLOC. I'm playing with ideas there, > but nothing I'm proud of yet. For now I see the in-kernel approach as the > conservative one, for anything that could possibly find itself on the VM > writeout path. > You'd obviously want to make the PF_MEMALLOC task as tight as possible, and running mlocked: I don't particularly see why such a task would be any safer in-kernel. PF_MEMALLOC tasks won't enter page reclaim at all. The only way they will reach the writeout path is if you are write(2)ing stuff (you may hit synch writeout). ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-13 2:31 ` Nick Piggin @ 2004-07-27 3:31 ` Daniel Phillips 2004-07-27 4:07 ` Nick Piggin 0 siblings, 1 reply; 55+ messages in thread From: Daniel Phillips @ 2004-07-27 3:31 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, lmb, arjanv, sdake, teigland, linux-kernel On Monday 12 July 2004 22:31, Nick Piggin wrote: > Daniel Phillips wrote: > > On Monday 12 July 2004 16:54, Andrew Morton wrote: > >>Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >>>I don't see why it would be a problem to implement a "this task > >>>facilitates page reclaim" flag for userspace tasks that would take > >>>care of this as well as the kernel does. > >> > >>Yes, that has been done before, and it works - userspace "block drivers" > >>which permanently mark themselves as PF_MEMALLOC to avoid the obvious > >>deadlocks. > >> > >>Note that you can achieve a similar thing in current 2.6 by acquiring > >>realtime scheduling policy, but that's an artifact of some brainwave > >> which a VM hacker happened to have and isn't a thing which should be > >> relied upon. > > > > Do you have a pointer to the brainwave? > > Search for rt_task in mm/page_alloc.c Ah, interesting idea: realtime tasks get to dip into the PF_MEMALLOC reserve, until it gets down to some threshold, then they have to give up and wait like any other unwashed nobody of a process. _But_ if there's a user space process sitting in the writeout path and some other realtime process eats the entire realtime reserve, everything can still grind to a halt. So it's interesting for realtime, but does not solve the userspace PF_MEMALLOC inversion. > >>A privileged syscall which allows a task to mark itself as one which > >>cleans memory would make sense. > > > > For now we can do it with an ioctl, and we pretty much have to do it for > > pvmove. But that's when user space drives the kernel by syscalls; there > > is also the nasty (and common) case where the kernel needs userspace to > > do something for it while it's in PF_MEMALLOC. I'm playing with ideas > > there, but nothing I'm proud of yet. For now I see the in-kernel > > approach as the conservative one, for anything that could possibly find > > itself on the VM writeout path. > > You'd obviously want to make the PF_MEMALLOC task as tight as possible, > and running mlocked: Not just tight, but bounded. And tight too, of course. > I don't particularly see why such a task would be any safer in-kernel. The PF_MEMALLOC flag is inherited down a call chain, not across a pipe or similar IPC to user space. > PF_MEMALLOC tasks won't enter page reclaim at all. The only way they > will reach the writeout path is if you are write(2)ing stuff (you may > hit synch writeout). That's the problem. Regards, Daniel ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-27 3:31 ` Daniel Phillips @ 2004-07-27 4:07 ` Nick Piggin 2004-07-27 5:57 ` Daniel Phillips 0 siblings, 1 reply; 55+ messages in thread From: Nick Piggin @ 2004-07-27 4:07 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andrew Morton, lmb, arjanv, sdake, teigland, linux-kernel Daniel Phillips wrote: >On Monday 12 July 2004 22:31, Nick Piggin wrote: > >> >>Search for rt_task in mm/page_alloc.c >> > >Ah, interesting idea: realtime tasks get to dip into the PF_MEMALLOC reserve, >until it gets down to some threshold, then they have to give up and wait like >any other unwashed nobody of a process. _But_ if there's a user space >process sitting in the writeout path and some other realtime process eats the >entire realtime reserve, everything can still grind to a halt. > >So it's interesting for realtime, but does not solve the userspace PF_MEMALLOC >inversion. > > Not the rt_task thing, because yes, you can have other RT tasks that aren't small and bounded that screw up your reserves. But a PF_MEMALLOC userspace task is still useful. >>>>A privileged syscall which allows a task to mark itself as one which >>>>cleans memory would make sense. >>>> >>>For now we can do it with an ioctl, and we pretty much have to do it for >>>pvmove. But that's when user space drives the kernel by syscalls; there >>>is also the nasty (and common) case where the kernel needs userspace to >>>do something for it while it's in PF_MEMALLOC. I'm playing with ideas >>>there, but nothing I'm proud of yet. For now I see the in-kernel >>>approach as the conservative one, for anything that could possibly find >>>itself on the VM writeout path. >>> >>You'd obviously want to make the PF_MEMALLOC task as tight as possible, >>and running mlocked: >> > >Not just tight, but bounded. And tight too, of course. > > >>I don't particularly see why such a task would be any safer in-kernel. >> > >The PF_MEMALLOC flag is inherited down a call chain, not across a pipe or >similar IPC to user space. > > This is no different in kernel of course. You would have to think about which threads need the flag and which do not. Even better, you might aquire and drop the flag only when required. I can't see any obvious problems you would run into. >>PF_MEMALLOC tasks won't enter page reclaim at all. The only way they >>will reach the writeout path is if you are write(2)ing stuff (you may >>hit synch writeout). >> > >That's the problem. > > Well I don't think it would be a problem to get the write throttling path to ignore PF_MEMALLOC tasks if that is what you need. Again, this shouldn't be any different to in kernel code. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-27 4:07 ` Nick Piggin @ 2004-07-27 5:57 ` Daniel Phillips 0 siblings, 0 replies; 55+ messages in thread From: Daniel Phillips @ 2004-07-27 5:57 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, lmb, arjanv, sdake, teigland, linux-kernel On Tuesday 27 July 2004 00:07, Nick Piggin wrote: > But a PF_MEMALLOC userspace task is still useful. Absolutely. This is the route I'm taking, and I just use an ioctl to flip the task bit as I mentioned (much) earlier. It still needs to be beaten up in practice. The cluster snapshot block device, which has a relatively complex userspace server, should be a nice test case. > >The PF_MEMALLOC flag is inherited down a call chain, not across a pipe or > >similar IPC to user space. > > This is no different in kernel of course. I was talking about in-kernel. Once we let the PF_MEMALLOC state escape to user space, things start looking brighter. But you still have to invoke that userspace code somehow, and there is no direct way to do it, hence PF_MEMALLOC isn't inherited. An easy solution is to have a userspace daemon that's always in PF_MEMALLOC state, as Andrew mentioned, which we can control via a pipe or similar. > You would have to think about > which threads need the flag and which do not. Even better, you might > aquire and drop the flag only when required. Yes, that's what the ioctl is about. However, this doesn't work for servicing writeout. > I can't see any obvious problems you would run into. ;-) Regards, Daniel ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-12 20:54 ` Andrew Morton 2004-07-13 2:19 ` Daniel Phillips @ 2004-07-14 12:19 ` Pavel Machek 2004-07-15 2:19 ` Nick Piggin 1 sibling, 1 reply; 55+ messages in thread From: Pavel Machek @ 2004-07-14 12:19 UTC (permalink / raw) To: Andrew Morton Cc: Nick Piggin, lmb, arjanv, phillips, sdake, teigland, linux-kernel Hi! > > I don't see why it would be a problem to implement a "this task > > facilitates page reclaim" flag for userspace tasks that would take > > care of this as well as the kernel does. > > Yes, that has been done before, and it works - userspace "block drivers" > which permanently mark themselves as PF_MEMALLOC to avoid the obvious > deadlocks. > Note that you can achieve a similar thing in current 2.6 by acquiring > realtime scheduling policy, but that's an artifact of some brainwave which > a VM hacker happened to have and isn't a thing which should be relied upon. > > A privileged syscall which allows a task to mark itself as one which > cleans memory would make sense. Does it work? I mean, in kernel, we have some memory cleaners (say 5), and they need, say, 1MB total reserved memory. Now, if you add another task with PF_MEMALLOC. But now you'd need 1.2MB reserved memory, and you only have 1MB. Things are obviously going to break at some point. Pavel -- People were complaining that M$ turns users into beta-testers... ...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl! ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-14 12:19 ` Pavel Machek @ 2004-07-15 2:19 ` Nick Piggin 2004-07-15 12:03 ` Marcelo Tosatti 0 siblings, 1 reply; 55+ messages in thread From: Nick Piggin @ 2004-07-15 2:19 UTC (permalink / raw) To: Pavel Machek Cc: Andrew Morton, lmb, arjanv, phillips, sdake, teigland, linux-kernel Pavel Machek wrote: > Hi! > > >>>I don't see why it would be a problem to implement a "this task >>>facilitates page reclaim" flag for userspace tasks that would take >>>care of this as well as the kernel does. >> >>Yes, that has been done before, and it works - userspace "block drivers" >>which permanently mark themselves as PF_MEMALLOC to avoid the obvious >>deadlocks. > > >>Note that you can achieve a similar thing in current 2.6 by acquiring >>realtime scheduling policy, but that's an artifact of some brainwave which >>a VM hacker happened to have and isn't a thing which should be relied upon. >> >>A privileged syscall which allows a task to mark itself as one which >>cleans memory would make sense. > > > Does it work? > > I mean, in kernel, we have some memory cleaners (say 5), and they > need, say, 1MB total reserved memory. > > Now, if you add another task with PF_MEMALLOC. But now you'd need > 1.2MB reserved memory, and you only have 1MB. Things are obviously > going to break at some point. > Pavel Well you'd have to be more careful than that. In particular you wouldn't just be starting these things up, let alone have them allocate 1MB in to free some memory. This situation would still blow up whether you did it in kernel or not. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-15 2:19 ` Nick Piggin @ 2004-07-15 12:03 ` Marcelo Tosatti 0 siblings, 0 replies; 55+ messages in thread From: Marcelo Tosatti @ 2004-07-15 12:03 UTC (permalink / raw) To: Nick Piggin Cc: Pavel Machek, Andrew Morton, lmb, arjanv, phillips, sdake, teigland, linux-kernel On Thu, Jul 15, 2004 at 12:19:12PM +1000, Nick Piggin wrote: > Pavel Machek wrote: > >Hi! > > > > > >>>I don't see why it would be a problem to implement a "this task > >>>facilitates page reclaim" flag for userspace tasks that would take > >>>care of this as well as the kernel does. > >> > >>Yes, that has been done before, and it works - userspace "block drivers" > >>which permanently mark themselves as PF_MEMALLOC to avoid the obvious > >>deadlocks. Andrew, as curiosity, what userspace "block driver" sets PF_MEMALLOC for normal operation? > >>Note that you can achieve a similar thing in current 2.6 by acquiring > >>realtime scheduling policy, but that's an artifact of some brainwave which > >>a VM hacker happened to have and isn't a thing which should be relied > >>upon. > >> > >>A privileged syscall which allows a task to mark itself as one which > >>cleans memory would make sense. > > > > > >Does it work? > > > >I mean, in kernel, we have some memory cleaners (say 5), and they > >need, say, 1MB total reserved memory. > > > >Now, if you add another task with PF_MEMALLOC. But now you'd need > >1.2MB reserved memory, and you only have 1MB. Things are obviously > >going to break at some point. > > Pavel > > Well you'd have to be more careful than that. In particular > you wouldn't just be starting these things up, let alone > have them allocate 1MB in to free some memory. > > This situation would still blow up whether you did it in > kernel or not. Indeed, such PF_MEMALLOC app can probably kill the system if it bugs allocating lots of memory from the lower reservations. It needs some limitation. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-12 10:21 ` Lars Marowsky-Bree 2004-07-12 10:28 ` Arjan van de Ven @ 2004-07-14 8:32 ` Pavel Machek 1 sibling, 0 replies; 55+ messages in thread From: Pavel Machek @ 2004-07-14 8:32 UTC (permalink / raw) To: Lars Marowsky-Bree Cc: Arjan van de Ven, Daniel Phillips, sdake, David Teigland, linux-kernel Hi! > However, of course this is more difficult for the case where you are in > the write path needed to free some memory; alas, swapping to a GFS mount > is probably a realllllly silly idea, too. Swapping to GFS mount is *very* similar. If swapping to GFS can not work, it is unlikely write support will be reliable. -- 64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-11 19:44 ` Daniel Phillips 2004-07-11 21:06 ` Lars Marowsky-Bree @ 2004-07-12 4:08 ` Steven Dake 2004-07-12 4:23 ` Daniel Phillips 1 sibling, 1 reply; 55+ messages in thread From: Steven Dake @ 2004-07-12 4:08 UTC (permalink / raw) To: Daniel Phillips Cc: Daniel Phillips, David Teigland, linux-kernel, Lars Marowsky-Bree On Sun, 2004-07-11 at 12:44, Daniel Phillips wrote: > On Saturday 10 July 2004 19:24, Steven Dake wrote: > > On Sat, 2004-07-10 at 13:57, Daniel Phillips wrote: > > > On Saturday 10 July 2004 13:59, Steven Dake wrote: > > > > overload conditions that have caused the kernel to run low on memory > > > > are a difficult problem, even for kernel components... > > > > ...I hope that helps atleast answer that some r&d is underway to solve > > > > this particular overload problem in userspace. > > > > > > I'm certain there's a solution, but until it is demonstrated and proved, > > > any userspace cluster services must be regarded with narrow squinty > > > eyes. > > > > I agree that a solution must be demonstrated and proved. > > > > There is another option, which I regularly recommend to anyone that > > must deal with memory overload conditions. Don't size the applications > > in such a way as to ever cause memory overload. > > That, and "just add more memory" are the two common mistakes people make when > thinking about this problem. The kernel _normally_ runs near the low-memory > barrier, on the theory that caching as much as possible is a good thing. > Running "near low memory conditions" and running in memory overload which triggers the OOM killer and other bad behaviors are two totally different conditions in the kernel. > Unless you can prove that your userspace approach never deadlocks, the other > questions don't even move the needle. I am sure that one day somebody, maybe > you, will demonstrate a userspace approach that is provably correct. Until > then, if you want your cluster to stay up and fail over properly, there's > only one game in town. > As soon as you have proved that cman's cluster protocol cannot be the target of attacks which lead to kernel faults or security faults.. Byzantine failures are a fact of life. There are protocols to minimize these sorts of attacks, but implementing them in the kernel is going to prove very difficult (but possible). One approach is to get them working in userspace correctly, and port them to the kernel. Oom conditions are another fact of life for poorly sized systems. If a cluster is within an OOM condition, it should be removed from the cluster (because it is in overload, under which unknown and generally bad behaviors occur). The openais project does just this: If everything goes to hell in a handbasket on the node running the cluster executive, it will be rejected from the membership. This rejection is implemented with a distributed state machine that ensures, even in low memory conditions, every node (including the failed node) reaches the same conclusions about the current membership and works today in the current code. If at a later time the processor can reenter the membership because it has freed up some memory, it will do so correctly. > We need to worry about ensuring that no API _depends_ on the cluster manager > being in-kernel, and we also need to seek out and excise any parts that could > possibly be moved out to user space without enabling the deadlock or grossly > messing up the kernel code. > > > > I'd invite you, or others interested in these sorts of services, to > > > > contribute that code, if interested. > > > > > > Humble suggestion: try grabbing the Red Hat (Sistina) DLM code and see > > > if you can hack it to do what you want. Just write a kernel module > > > that exports the DLM interface to userspace in the desired form. > > > > > > http://sources.redhat.com/cluster/dlm/ > > > > I would rather avoid non-mainline kernel dependencies at this time as it > > makes adoption difficult until kernel patches are merged into upstream > > code. Who wants to patch their kernel to try out some APIs? > > Everybody working on clusters. It's a fact of life that you have to apply > patches to run cluster filesystems right now. Production will be a different > story, but (except for the stable GFS code on 2.4) nobody is close to that. > Perhaps people skilled in running pre-alpha software would consider patching a kernel to "give it a run". I have no doubts about that. I would posit a guess people interested in implementing production clusters are not too interested about applying kernel patches (and causing their kernel to become unsupported) to achieve clustering support any time soon. > > I am doubtful these sort of kernel patches will be merged without a strong > > argument of why it absolutely must be implemented in the kernel vs all > > of the counter arguments against a kernel implementation. > > True. Do you agree that the PF_MEMALLOC argument is a strong one? > out of memory overload is a sucky situation poorly handled by any software, kernel, userland, embedded, whatever. The best solution is to size the applications such that a memory overload doesn't occur. Then if a memory overload condition does occur, that node should aleast become suspected of a byzantine failure condition which should cause its rejection from the current membership (in the case of a distributed system such as a cluster). > > There is one more advantage to group messaging and distributed locking > > implemented within the kernel, that I hadn't originally considered; it > > sure is sexy. > > I don't think it's sexy, I think it's ugly, to tell the truth. I am actively > researching how to move the slow-path cluster infrastructure out of kernel, > and I would be pleased to work together with anyone else who is interested in > this nasty problem. > There can be some advantages to group messaging being implemented in the kernel, if it is secure, done correctly (in my view, correctly means implementing the virtual synchrony model) and has low risk of impact to other systems. There are no kernel implemented clustering protocols that come close to these goals today. There are userland implementations under way which will meet these objectives. Perhaps these protocols could be ported to the kernel if group messaging absolutely must be available to kernel components without userland intervention. But I'm still not convinced userland isn't the correct place for these sorts of things. Thanks -steve > Regards, > > Daniel ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-12 4:08 ` Steven Dake @ 2004-07-12 4:23 ` Daniel Phillips 2004-07-12 18:21 ` Steven Dake 0 siblings, 1 reply; 55+ messages in thread From: Daniel Phillips @ 2004-07-12 4:23 UTC (permalink / raw) To: sdake; +Cc: Daniel Phillips, David Teigland, linux-kernel, Lars Marowsky-Bree On Monday 12 July 2004 00:08, Steven Dake wrote: > On Sun, 2004-07-11 at 12:44, Daniel Phillips wrote: > Oom conditions are another fact of life for poorly sized systems. If > a cluster is within an OOM condition, it should be removed from the > cluster (because it is in overload, under which unknown and generally > bad behaviors occur). You missed the point. The memory deadlock I pointed out occurs in _normal operation_. You have to find a way around it, or kernel cluster services win, plain and simple. > The openais project does just this: If everything goes to hell in a > handbasket on the node running the cluster executive, it will be > rejected from the membership. This rejection is implemented with a > distributed state machine that ensures, even in low memory > conditions, every node (including the failed node) reaches the same > conclusions about the current membership and works today in the > current code. If at a later time the processor can reenter the > membership because it has freed up some memory, it will do so > correctly. Think about it. Do you want nodes spontaneously falling over from time to time, even though nothing is wrong with them? What does that do your 5 nines? > > > I would rather avoid non-mainline kernel dependencies at this > > > time as it makes adoption difficult until kernel patches are > > > merged into upstream code. Who wants to patch their kernel to > > > try out some APIs? > > > > Everybody working on clusters. It's a fact of life that you have > > to apply patches to run cluster filesystems right now. Production > > will be a different story, but (except for the stable GFS code on > > 2.4) nobody is close to that. > > Perhaps people skilled in running pre-alpha software would consider > patching a kernel to "give it a run". I have no doubts about that. > > I would posit a guess people interested in implementing production > clusters are not too interested about applying kernel patches (and > causing their kernel to become unsupported) to achieve clustering > support any time soon. We are _far_ from production, at least on 2.6. At this point, we are only interested in people who like to code, test, tinker, and be the first kid on the block with a shiny new storage cluster in their rec room. And by "we" I mean "you, me, and everybody else who hopes that Linux will kick butt in clusters, in the 2.8 time frame." > > > I am doubtful these sort of kernel patches will be merged without > > > a strong argument of why it absolutely must be implemented in the > > > kernel vs all of the counter arguments against a kernel > > > implementation. > > > > True. Do you agree that the PF_MEMALLOC argument is a strong one? > > out of memory overload is a sucky situation poorly handled by any > software, kernel, userland, embedded, whatever. In case you missed it above, please let me point out one more time that I am not talking about OOM. I'm talking about a deadlock that may come up even when a resource usage is well within limits, which is inherent in the basic design of Linux. There is nothing Byzantine about it. Regards, Daniel ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-12 4:23 ` Daniel Phillips @ 2004-07-12 18:21 ` Steven Dake 2004-07-12 19:54 ` Daniel Phillips 2004-07-13 20:06 ` Pavel Machek 0 siblings, 2 replies; 55+ messages in thread From: Steven Dake @ 2004-07-12 18:21 UTC (permalink / raw) To: Daniel Phillips Cc: Daniel Phillips, David Teigland, linux-kernel, Lars Marowsky-Bree On Sun, 2004-07-11 at 21:23, Daniel Phillips wrote: > On Monday 12 July 2004 00:08, Steven Dake wrote: > > On Sun, 2004-07-11 at 12:44, Daniel Phillips wrote: > > Oom conditions are another fact of life for poorly sized systems. If > > a cluster is within an OOM condition, it should be removed from the > > cluster (because it is in overload, under which unknown and generally > > bad behaviors occur). > > You missed the point. The memory deadlock I pointed out occurs in > _normal operation_. You have to find a way around it, or kernel > cluster services win, plain and simple. > The bottom line is that we just don't know if any such deadlock occurs, under normal operations. The remaining objections to in-kernel cluster services give us alot of reason to test out a userland approach. I propose after a distributed lock service is implemented in user space, to add support for such a project into the gfs and remaining redhat storage cluster services trees. This will give us real data on performance and reliability that we can't get by guessing. Thanks -steve > > current code. If at a later time the processor can reenter the > > membership because it has freed up some memory, it will do so > > correctly. > > Think about it. Do you want nodes spontaneously falling over from time > to time, even though nothing is wrong with them? What does that do > your 5 nines? > > > > > I would rather avoid non-mainline kernel dependencies at this > > > > time as it makes adoption difficult until kernel patches are > > > > merged into upstream code. Who wants to patch their kernel to > > > > try out some APIs? > > > > > > Everybody working on clusters. It's a fact of life that you have > > > to apply patches to run cluster filesystems right now. Production > > > will be a different story, but (except for the stable GFS code on > > > 2.4) nobody is close to that. > > > > Perhaps people skilled in running pre-alpha software would consider > > patching a kernel to "give it a run". I have no doubts about that. > > > > I would posit a guess people interested in implementing production > > clusters are not too interested about applying kernel patches (and > > causing their kernel to become unsupported) to achieve clustering > > support any time soon. > > We are _far_ from production, at least on 2.6. At this point, we are > only interested in people who like to code, test, tinker, and be the > first kid on the block with a shiny new storage cluster in their rec > room. And by "we" I mean "you, me, and everybody else who hopes that > Linux will kick butt in clusters, in the 2.8 time frame." > > > > > I am doubtful these sort of kernel patches will be merged without > > > > a strong argument of why it absolutely must be implemented in the > > > > kernel vs all of the counter arguments against a kernel > > > > implementation. > > > > > > True. Do you agree that the PF_MEMALLOC argument is a strong one? > > > > out of memory overload is a sucky situation poorly handled by any > > software, kernel, userland, embedded, whatever. > > In case you missed it above, please let me point out one more time that > I am not talking about OOM. I'm talking about a deadlock that may come > up even when a resource usage is well within limits, which is inherent > in the basic design of Linux. There is nothing Byzantine about it. > > Regards, > > Daniel ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-12 18:21 ` Steven Dake @ 2004-07-12 19:54 ` Daniel Phillips 2004-07-13 20:06 ` Pavel Machek 1 sibling, 0 replies; 55+ messages in thread From: Daniel Phillips @ 2004-07-12 19:54 UTC (permalink / raw) To: sdake; +Cc: Daniel Phillips, David Teigland, linux-kernel, Lars Marowsky-Bree On Monday 12 July 2004 14:21, Steven Dake wrote: > On Sun, 2004-07-11 at 21:23, Daniel Phillips wrote: > > On Monday 12 July 2004 00:08, Steven Dake wrote: > > > On Sun, 2004-07-11 at 12:44, Daniel Phillips wrote: > > > Oom conditions are another fact of life for poorly sized systems. > > > If a cluster is within an OOM condition, it should be removed > > > from the cluster (because it is in overload, under which unknown > > > and generally bad behaviors occur). > > > > You missed the point. The memory deadlock I pointed out occurs in > > _normal operation_. You have to find a way around it, or kernel > > cluster services win, plain and simple. > > The bottom line is that we just don't know if any such deadlock > occurs, under normal operations. I thought I demonstrated that, should I restate? You need to point out the flaw in my argument (about the deadlock, not about philosophy). If/when you succeed, I will be pleased. Until you do succeed, there's a deadlock. Regards, Daniel ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-12 18:21 ` Steven Dake 2004-07-12 19:54 ` Daniel Phillips @ 2004-07-13 20:06 ` Pavel Machek 1 sibling, 0 replies; 55+ messages in thread From: Pavel Machek @ 2004-07-13 20:06 UTC (permalink / raw) To: Steven Dake Cc: Daniel Phillips, Daniel Phillips, David Teigland, linux-kernel, Lars Marowsky-Bree Hi! > > You missed the point. The memory deadlock I pointed out occurs in > > _normal operation_. You have to find a way around it, or kernel > > cluster services win, plain and simple. > > > > The bottom line is that we just don't know if any such deadlock occurs, > under normal operations. The remaining objections to in-kernel cluster I did some work on swapping-over-nbd, which has similar issues, and yes, the deadlocks were seen under heavy load. *Designing* something with "lets hope it does not deadlock", while deadlock clearly can be triggered, looks like bad idea. -- 64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-08 10:53 ` David Teigland 2004-07-08 14:14 ` Chris Friesen 2004-07-08 18:22 ` Daniel Phillips @ 2004-07-12 10:14 ` Lars Marowsky-Bree 2 siblings, 0 replies; 55+ messages in thread From: Lars Marowsky-Bree @ 2004-07-12 10:14 UTC (permalink / raw) To: David Teigland, linux-kernel; +Cc: Daniel Phillips On 2004-07-08T18:53:38, David Teigland <teigland@redhat.com> said: > I'm afraid the fencing issue has been rather misrepresented. Here's > what we're doing (a lot of background is necessary I'm afraid.) We > have a symmetric, kernel-based, stand-alone cluster manager (CMAN) > that has no ties to anything else whatsoever. It'll simply run and > answer the question "who's in the cluster?" by providing a list of > names/nodeids. Excuse my ignorance, but does this ensure that there's concensus among the nodes about this membership? > has quorum. It's a very standard way of doing things -- we modelled it > directly off the VMS-cluster style. Whether you care about this quorum value > or what you do with it are beside the point. OK, I agree with this. As long as the CMAN itself doesn't care about this either but just reports it to the cluster, that's fine. > What about Fencing? Fencing is not a part of the cluster manager, not > a part of the dlm and not a part of gfs. It's an entirely independent > system that runs on its own in userland. It depends on cman for > cluster information just like the dlm or gfs does. I'll repeat what I > said on the linux-cluster mailing list: I doubt it can be entirely independent; or how do you implement lock recovery without a fencing mechanism? > This fencing system is suitable for us in our gfs/clvm work. It's > probably suitable for others, too. For everyone? no. It sounds useful enough even for our work, given appropriate notification of fencing events; instead of scheduling a fencing event, we'd need to make sure that the node joins a fencing domain and later block until receiving a notification. It's not as fine grained, but our approach (based on the dependencies of the resources managed, basically) might have been more fine grained than required in a typical environment. Yes, I can see how that could be made to work. Sincerely, Lars Marowsky-Brée <lmb@suse.de> -- High Availability & Clustering \ ever tried. ever failed. no matter. SUSE Labs, Research and Development | try again. fail again. fail better. SUSE LINUX AG - A Novell company \ -- Samuel Beckett ^ permalink raw reply [flat|nested] 55+ messages in thread
[parent not found: <fa.io9lp90.1c02foo@ifi.uio.no>]
[parent not found: <fa.go9f063.1i72joh@ifi.uio.no>]
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 [not found] ` <fa.go9f063.1i72joh@ifi.uio.no> @ 2004-07-06 6:39 ` Aneesh Kumar K.V 0 siblings, 0 replies; 55+ messages in thread From: Aneesh Kumar K.V @ 2004-07-06 6:39 UTC (permalink / raw) To: Chris Friesen; +Cc: Daniel Phillips, Christoph Hellwig, linux-kernel Chris Friesen wrote: > Daniel Phillips wrote: > >> Don't you think we ought to take a look at how OCFS and GFS might share >> some of the same infrastructure, for example, the DLM and cluster >> membership services? > > > For cluster membership, you might consider looking at the OpenAIS CLM > portion. It would be nice if this type of thing was unified across more > than just filesystems. > > How about looking Cluster Infrastructure ( http://ci-linux.sf.net ) and OpenSSI ( http://www.openssi.org ) for cluster membership service. -aneesh ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
@ 2004-07-10 14:58 James Bottomley
2004-07-10 16:04 ` David Teigland
0 siblings, 1 reply; 55+ messages in thread
From: James Bottomley @ 2004-07-10 14:58 UTC (permalink / raw)
To: David Teigland; +Cc: Linux Kernel
gfs needs to run in the kernel. dlm should run in the kernel since gfs uses it
so heavily. cman is the clustering subsystem on top of which both of those are
built and on which both depend quite critically. It simply makes most sense to
put cman in the kernel for what we're doing with it. That's not a dogmatic
position, just a practical one based on our experience.
This isn't really acceptable. We've spent a long time throwing things
out of the kernel so you really need a good justification for putting
things in again. "it makes sense" and "its just practical" aren't
sufficient.
You also face two other additional hurdles:
1) GFS today uses a user space DLM. What critical problems does this
have that you suddenly need to move it all into the kernel?
2) We have numerous other clustering products for Linux, none of which
(well except the Veritas one) has any requirement at all on having
pieces in the kernel. If all the others operate in user space, why does
yours need to be in the kernel?
So do you have a justification for requiring these as kernel components?
James
^ permalink raw reply [flat|nested] 55+ messages in thread* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-10 14:58 James Bottomley @ 2004-07-10 16:04 ` David Teigland 2004-07-10 16:26 ` James Bottomley 0 siblings, 1 reply; 55+ messages in thread From: David Teigland @ 2004-07-10 16:04 UTC (permalink / raw) To: linux-kernel; +Cc: James Bottomley On Sat, Jul 10, 2004 at 09:58:02AM -0500, James Bottomley wrote: > gfs needs to run in the kernel. dlm should run in the kernel since gfs > uses it so heavily. cman is the clustering subsystem on top of which > both of those are built and on which both depend quite critically. It > simply makes most sense to put cman in the kernel for what we're doing > with it. That's not a dogmatic position, just a practical one based on > our experience. > > This isn't really acceptable. We've spent a long time throwing things out of > the kernel so you really need a good justification for putting things in > again. "it makes sense" and "its just practical" aren't sufficient. The "it" refers to gfs. This means gfs doesn't make a lot of sense and isn't very practical without it. I'm not the one to speculate on what gfs would become otherwise, others would do that better. > You also face two other additional hurdles: > > 1) GFS today uses a user space DLM. What critical problems does this have > that you suddenly need to move it all into the kernel? GFS does not use a user space dlm today. GFS uses the client-server gulm lock manager for which the client (gfs) side runs in the kernel and the gulm server runs in userspace on some other node. People have naturally been averse to using servers like this with gfs for a long time and we've finally created the serverless dlm (a la VMS clusters). For many people this is the only option that makes gfs interesting; it's also what the opengfs group was doing. This is a revealing discussion. We've worked hard to make gfs's lock manager independent from gfs itself so it could be useful to others and make gfs less monolithic. We could have left it embedded within the file system itself -- that's what most other cluster file systems do. If we'd done that we would have avoided this objection altogether but with an inferior design. The fact that there's an independent lock manager to point at and question illustrates our success. The same goes for the cluster manager. (We could, of course, do some simple glueing together and make a monlithic system again :-) > 2) We have numerous other clustering products for Linux, none of which (well > except the Veritas one) has any requirement at all on having pieces in the > kernel. If all the others operate in user space, why does yours need to be > in the kernel? If you want gfs in user space you don't want gfs; you want something different. -- Dave Teigland <teigland@redhat.com> ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 2004-07-10 16:04 ` David Teigland @ 2004-07-10 16:26 ` James Bottomley 0 siblings, 0 replies; 55+ messages in thread From: James Bottomley @ 2004-07-10 16:26 UTC (permalink / raw) To: David Teigland; +Cc: Linux Kernel On Sat, 2004-07-10 at 11:04, David Teigland wrote: > The "it" refers to gfs. This means gfs doesn't make a lot of sense and isn't > very practical without it. I'm not the one to speculate on what gfs would > become otherwise, others would do that better. This is what you actually said: > It simply makes most sense to put cman in the kernel for > what we're doing with it. I interpret that to mean you think cman (your cluster manager) should be in the kernel. Is this incorrect? > > > You also face two other additional hurdles: > > > > 1) GFS today uses a user space DLM. What critical problems does this have > > that you suddenly need to move it all into the kernel? > > GFS does not use a user space dlm today. GFS uses the client-server gulm lock > manager for which the client (gfs) side runs in the kernel and the gulm server > runs in userspace on some other node. People have naturally been averse to > using servers like this with gfs for a long time and we've finally created the > serverless dlm (a la VMS clusters). For many people this is the only option > that makes gfs interesting; it's also what the opengfs group was doing. OK, whatever you choose to call it, the previous lock manager used by gfs was userspace. OK, so why is a kernel based DLM the only option that makes GFS interesting? What are the concrete advantages you achieve with a kernel based DLM that you don't get with a user space one? There are plenty of symmetric serverless userspace DLM implementations that follow the old VMS (and even updated by Oracle) spec. Steve Dake has already given a pretty compelling list of why you shouldn't put the DLM and clustering in the kernel, what is the more compelling list of reasons why it should be? > This is a revealing discussion. We've worked hard to make gfs's lock manager > independent from gfs itself so it could be useful to others and make gfs less > monolithic. We could have left it embedded within the file system itself -- > that's what most other cluster file systems do. If we'd done that we would > have avoided this objection altogether but with an inferior design. The fact > that there's an independent lock manager to point at and question illustrates > our success. The same goes for the cluster manager. (We could, of course, do > some simple glueing together and make a monlithic system again :-) I'm not questioning your goal, merely your in-kernel implementation. Sharing is good. However things which are shared don't automatically have to be in-kernel. > > 2) We have numerous other clustering products for Linux, none of which (well > > except the Veritas one) has any requirement at all on having pieces in the > > kernel. If all the others operate in user space, why does yours need to be > > in the kernel? > > If you want gfs in user space you don't want gfs; you want something different. I didn't say GFS, I said "cluster products". That's the DLM and CMAN pieces of your architecture. Once you can convince us that CMAN et al should be in the kernel, the next stage of the discussion would be the API. Several groups (like GGL, SAF and OCF) have done API work for clusters. They were mostly careful to select APIs that avoided mandating cluster policy. You seem to have chosen a particular policy (voting quorate) to implement. Again, that's a red flag. Policy should not be in the kernel; if we all agree there should be in-kernel APIs for clustering then they should be sufficiently abstracted to support all current cluster policies. James ^ permalink raw reply [flat|nested] 55+ messages in thread
end of thread, other threads:[~2004-07-27 5:56 UTC | newest]
Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-07-05 6:09 [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 Daniel Phillips
2004-07-05 15:09 ` Christoph Hellwig
2004-07-05 18:42 ` Daniel Phillips
2004-07-05 19:08 ` Chris Friesen
2004-07-05 20:29 ` Daniel Phillips
2004-07-07 22:55 ` Steven Dake
2004-07-08 1:30 ` Daniel Phillips
2004-07-05 19:12 ` Lars Marowsky-Bree
2004-07-05 20:27 ` Daniel Phillips
2004-07-06 7:34 ` Lars Marowsky-Bree
2004-07-06 21:34 ` Daniel Phillips
2004-07-07 18:16 ` Lars Marowsky-Bree
2004-07-08 1:14 ` Daniel Phillips
2004-07-08 9:10 ` Lars Marowsky-Bree
2004-07-08 10:53 ` David Teigland
2004-07-08 14:14 ` Chris Friesen
2004-07-08 16:06 ` David Teigland
2004-07-08 18:22 ` Daniel Phillips
2004-07-08 19:41 ` Steven Dake
2004-07-10 4:58 ` David Teigland
2004-07-10 4:58 ` Daniel Phillips
2004-07-10 17:59 ` Steven Dake
2004-07-10 20:57 ` Daniel Phillips
2004-07-10 23:24 ` Steven Dake
2004-07-11 19:44 ` Daniel Phillips
2004-07-11 21:06 ` Lars Marowsky-Bree
2004-07-12 6:58 ` Arjan van de Ven
2004-07-12 10:05 ` Lars Marowsky-Bree
2004-07-12 10:11 ` Arjan van de Ven
2004-07-12 10:21 ` Lars Marowsky-Bree
2004-07-12 10:28 ` Arjan van de Ven
2004-07-12 11:50 ` Lars Marowsky-Bree
2004-07-12 12:01 ` Arjan van de Ven
2004-07-12 13:13 ` Lars Marowsky-Bree
2004-07-12 13:40 ` Nick Piggin
2004-07-12 20:54 ` Andrew Morton
2004-07-13 2:19 ` Daniel Phillips
2004-07-13 2:31 ` Nick Piggin
2004-07-27 3:31 ` Daniel Phillips
2004-07-27 4:07 ` Nick Piggin
2004-07-27 5:57 ` Daniel Phillips
2004-07-14 12:19 ` Pavel Machek
2004-07-15 2:19 ` Nick Piggin
2004-07-15 12:03 ` Marcelo Tosatti
2004-07-14 8:32 ` Pavel Machek
2004-07-12 4:08 ` Steven Dake
2004-07-12 4:23 ` Daniel Phillips
2004-07-12 18:21 ` Steven Dake
2004-07-12 19:54 ` Daniel Phillips
2004-07-13 20:06 ` Pavel Machek
2004-07-12 10:14 ` Lars Marowsky-Bree
[not found] <fa.io9lp90.1c02foo@ifi.uio.no>
[not found] ` <fa.go9f063.1i72joh@ifi.uio.no>
2004-07-06 6:39 ` Aneesh Kumar K.V
-- strict thread matches above, loose matches on Subject: below --
2004-07-10 14:58 James Bottomley
2004-07-10 16:04 ` David Teigland
2004-07-10 16:26 ` James Bottomley
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox