public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
@ 2004-07-05  6:09 Daniel Phillips
  2004-07-05 15:09 ` Christoph Hellwig
  0 siblings, 1 reply; 55+ messages in thread
From: Daniel Phillips @ 2004-07-05  6:09 UTC (permalink / raw)
  To: linux-kernel

Red Hat and (the former) Sistina Software are pleased to announce that 
we will host a two day kickoff workshop on GFS and Cluster 
Infrastructure in Minneapolis, July 29 and 30, not too long after OLS.  
We call this the "Cluster Summit" because it goes well beyond GFS, and 
is really about building a comprehensive cluster infrastructure for 
Linux, which will hopefully be a reality by the time Linux 2.8 arrives.  
If we want that, we have to start now, and we have to work like fiends, 
time is short.  We offer as a starting point, functional code for a 
half-dozen major, generic cluster subsystems that Sistina has had under 
development for several years.

This means not just a cluster filesystem, but cluster logical volume 
management, generic distributed locking, cluster membership services, 
node fencing, user space utilities, graphical interfaces and more.  Of 
course, it's all up for peer review.  Everybody is invited, and yes, 
that includes OCFS and Lustre folks too. Speaking as an honorary 
OpenGFS team member, we will be there in force.

Tentative agenda items:

   - GFS walkthrough: let's get hacking
   - GULM, the Grand Unified Lock Manager
   - Sistina's brand new Distributed Lock Manager
   - Symmetric Cluster Architecture walkthrough
   - Are we there yet?  Infrastructure directions
   - GFS: Great, it works!  What next?

Further details, including information on travel and hotel arrangements, 
will be posted over the next few days on the Red Hat sponsored 
community cluster page:

   http://sources.redhat.com/cluster/

Unfortunately, space is limited.  We feel we can accommodate about fifty 
people comfortably.  Registration is first come, first served.  The 
price is: Free!  (Of course.)  If you're interested, please email me.

Let's set our sights on making Linux 2.8 a true cluster operating 
system.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-05  6:09 [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 Daniel Phillips
@ 2004-07-05 15:09 ` Christoph Hellwig
  2004-07-05 18:42   ` Daniel Phillips
  0 siblings, 1 reply; 55+ messages in thread
From: Christoph Hellwig @ 2004-07-05 15:09 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel

On Mon, Jul 05, 2004 at 02:09:29AM -0400, Daniel Phillips wrote:
> Red Hat and (the former) Sistina Software are pleased to announce that 
> we will host a two day kickoff workshop on GFS and Cluster 
> Infrastructure in Minneapolis, July 29 and 30, not too long after OLS.  
> We call this the "Cluster Summit" because it goes well beyond GFS, and 
> is really about building a comprehensive cluster infrastructure for 
> Linux, which will hopefully be a reality by the time Linux 2.8 arrives.  
> If we want that, we have to start now, and we have to work like fiends, 
> time is short.  We offer as a starting point, functional code for a 
> half-dozen major, generic cluster subsystems that Sistina has had under 
> development for several years.

Don't you think it's a little too short-term?  I'd rather see the cluster
software that could be merged mid-term on KS (and that seems to be only OCFS2
so far)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-05 15:09 ` Christoph Hellwig
@ 2004-07-05 18:42   ` Daniel Phillips
  2004-07-05 19:08     ` Chris Friesen
  2004-07-05 19:12     ` Lars Marowsky-Bree
  0 siblings, 2 replies; 55+ messages in thread
From: Daniel Phillips @ 2004-07-05 18:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel

Hi Christoph,

On Monday 05 July 2004 11:09, Christoph Hellwig wrote:
> On Mon, Jul 05, 2004 at 02:09:29AM -0400, Daniel Phillips wrote:
> > Red Hat and (the former) Sistina Software are pleased to announce
> > that we will host a two day kickoff workshop on GFS and Cluster
> > Infrastructure in Minneapolis, July 29 and 30, not too long after
> > OLS. We call this the "Cluster Summit" because it goes well beyond
> > GFS, and is really about building a comprehensive cluster
> > infrastructure for Linux, which will hopefully be a reality by the
> > time Linux 2.8 arrives. If we want that, we have to start now, and
> > we have to work like fiends, time is short.  We offer as a starting
> > point, functional code for a half-dozen major, generic cluster
> > subsystems that Sistina has had under development for several
> > years.
>
> Don't you think it's a little too short-term?

Not really.  It's several months later than it should have been if 
anything.

> I'd rather see the 
> cluster software that could be merged mid-term on KS (and that seems
> to be only OCFS2 so far)

Don't you think we ought to take a look at how OCFS and GFS might share 
some of the same infrastructure, for example, the DLM and cluster 
membership services?

"Think twice, merge once"

Regards,

Daniel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-05 18:42   ` Daniel Phillips
@ 2004-07-05 19:08     ` Chris Friesen
  2004-07-05 20:29       ` Daniel Phillips
  2004-07-05 19:12     ` Lars Marowsky-Bree
  1 sibling, 1 reply; 55+ messages in thread
From: Chris Friesen @ 2004-07-05 19:08 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Christoph Hellwig, linux-kernel

Daniel Phillips wrote:

> Don't you think we ought to take a look at how OCFS and GFS might share
> some of the same infrastructure, for example, the DLM and cluster
> membership services?

For cluster membership, you might consider looking at the OpenAIS CLM portion. 
It would be nice if this type of thing was unified across more than just 
filesystems.

Chris

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-05 18:42   ` Daniel Phillips
  2004-07-05 19:08     ` Chris Friesen
@ 2004-07-05 19:12     ` Lars Marowsky-Bree
  2004-07-05 20:27       ` Daniel Phillips
  1 sibling, 1 reply; 55+ messages in thread
From: Lars Marowsky-Bree @ 2004-07-05 19:12 UTC (permalink / raw)
  To: Daniel Phillips, Christoph Hellwig; +Cc: linux-kernel

On 2004-07-05T14:42:27,
   Daniel Phillips <phillips@redhat.com> said:

> Don't you think we ought to take a look at how OCFS and GFS might share 
> some of the same infrastructure, for example, the DLM and cluster 
> membership services?

Indeed. If your efforts in joining the infrastructure are more
successful than ours have been, more power to you ;-)


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	    \ ever tried. ever failed. no matter.
SUSE Labs, Research and Development | try again. fail again. fail better.
SUSE LINUX AG - A Novell company    \ 	-- Samuel Beckett


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-05 19:12     ` Lars Marowsky-Bree
@ 2004-07-05 20:27       ` Daniel Phillips
  2004-07-06  7:34         ` Lars Marowsky-Bree
  0 siblings, 1 reply; 55+ messages in thread
From: Daniel Phillips @ 2004-07-05 20:27 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: Christoph Hellwig, linux-kernel

Hi Lars,

On Monday 05 July 2004 15:12, Lars Marowsky-Bree wrote:
> On 2004-07-05T14:42:27,
>
>    Daniel Phillips <phillips@redhat.com> said:
> > Don't you think we ought to take a look at how OCFS and GFS might
> > share some of the same infrastructure, for example, the DLM and
> > cluster membership services?
>
> Indeed. If your efforts in joining the infrastructure are more
> successful than ours have been, more power to you ;-)

What problems did you run into?

On a quick read-through, it seems quite straightforward for quorum, 
membership and distributed locking.

The idea of having more than one node fencing system running at the same 
time seems deeply scary, we'd better make some effort to come up with 
something common.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-05 19:08     ` Chris Friesen
@ 2004-07-05 20:29       ` Daniel Phillips
  2004-07-07 22:55         ` Steven Dake
  0 siblings, 1 reply; 55+ messages in thread
From: Daniel Phillips @ 2004-07-05 20:29 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Christoph Hellwig, linux-kernel

On Monday 05 July 2004 15:08, Chris Friesen wrote:
> Daniel Phillips wrote:
> > Don't you think we ought to take a look at how OCFS and GFS might
> > share some of the same infrastructure, for example, the DLM and
> > cluster membership services?
>
> For cluster membership, you might consider looking at the OpenAIS CLM
> portion.  It would be nice if this type of thing was unified across 
> more than just filesystems.

My own project is a block driver, that's not a filesystem, right?  
Cluster membership services as implemented by Sistina are generic, 
symmetric and (hopefully) raceless.  See:

  http://www.usenix.org/publications/library/proceedings/als00/2000papers/papers/full_papers/preslan/preslan.pdf

There is much overlap between the OpenAIS and Sistina's Symmetric 
Cluster Architecture.  You are right, we do need to get together.

By the way, how do I get your source code if I don't agree with the 
BitKeeper license?

Regards,

Daniel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
       [not found] ` <fa.go9f063.1i72joh@ifi.uio.no>
@ 2004-07-06  6:39   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 55+ messages in thread
From: Aneesh Kumar K.V @ 2004-07-06  6:39 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Daniel Phillips, Christoph Hellwig, linux-kernel

Chris Friesen wrote:
> Daniel Phillips wrote:
> 
>> Don't you think we ought to take a look at how OCFS and GFS might share
>> some of the same infrastructure, for example, the DLM and cluster
>> membership services?
> 
> 
> For cluster membership, you might consider looking at the OpenAIS CLM 
> portion. It would be nice if this type of thing was unified across more 
> than just filesystems.
> 
> 

How about  looking Cluster Infrastructure ( http://ci-linux.sf.net ) and 
OpenSSI ( http://www.openssi.org ) for cluster membership service.

-aneesh

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-05 20:27       ` Daniel Phillips
@ 2004-07-06  7:34         ` Lars Marowsky-Bree
  2004-07-06 21:34           ` Daniel Phillips
  0 siblings, 1 reply; 55+ messages in thread
From: Lars Marowsky-Bree @ 2004-07-06  7:34 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel

On 2004-07-05T16:27:51,
   Daniel Phillips <phillips@redhat.com> said:

> > Indeed. If your efforts in joining the infrastructure are more
> > successful than ours have been, more power to you ;-)
> 
> What problems did you run into?

The problems were mostly political. Maybe we tried to push too early,
but 1-3 years back, people weren't really interested in agreeing on some
common components or APIs. In particular a certain Linux vendor didn't
even join the group ;-) And the "industry" was very reluctant too. Which
meant that everybody spend ages talking and not much happening.

However, times may have changed, and hopefully for the better. The push
to get one solution included into the Linux kernel may be enough to
convince people that this time its for real...

There still is the Open Clustering Framework group though, which is a
sub-group of the FSG and maybe the right umbrella to put this under, to
stay away from the impression that it's a single vendor pushing.

If we could revive that and make real progress, I'd be as happy as a
well fed penguin.

Now with OpenAIS on the table, the GFS stack, the work already done by
OCF in the past (which is, admittedly, depressingly little, but I quite
like the Resource Agent API for one) et cetera, there may be a good
chance.

I'll try to get travel approval to go to the meeting. 

BTW, is the mailing list working? I tried subscribing when you first
announced it, but the subscription request hasn't been approved yet...
Maybe I shouldn't have subscribed with the suse.de address ;-)

> On a quick read-through, it seems quite straightforward for quorum, 
> membership and distributed locking.

Believe me, you'd be amazed to find out how long you can argue on how to
identify a node alone - node name, node number (sparse or continuous?),
UUID...? ;-)

And, how do you define quorum, and is it always needed? Some algorithms
don't need quorum (ie, election algorithms can do fine without), so a
membership service which only works with quorum isn't the right
component etc...

> The idea of having more than one node fencing system running at the same 
> time seems deeply scary, we'd better make some effort to come up with 
> something common.

Yes. This is actually an important point, and fencing policies are also
reasonably complex. The GFS stack seems to tie fencing quite deeply into
the system (which is understandable, since you always have shared
storage, otherwise a node wouldn't be part of the GFS domain in the
first place).

However, the new dependency based cluster resource manager we are
writing right now (which we simply call "Cluster Resource Manager" for
lack of creativity ;) decides whether or not it needs to fence a node
based on the resources in the cluster - if it isn't affecting the
resources we can run on the remaining nodes, or none of the resources
requires node-level fencing, no such operation will be done. 

This has advantages in larger clusters (where, if split, each partition
could still continue to run resources which are unaffected by the split
even the other nodes cannot be fenced), in shared nothing clusters or
resources which are self-fencing and do not need STONITH etc.

The ties between membership, quorum and fencing are not as strong in
these scenarios, at least not mandatory. So a stack which enforced
fencing at these levels, and w/o coordinating with the CRM first, would
not work out.

And by pushing for inclusion into the main kernel, you'll also raise all
sleeping zom^Wbeauties. I hope you have a long breath for the
discussions ;-)

There's lots of work there.


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	    \ ever tried. ever failed. no matter.
SUSE Labs, Research and Development | try again. fail again. fail better.
SUSE LINUX AG - A Novell company    \ 	-- Samuel Beckett


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-06  7:34         ` Lars Marowsky-Bree
@ 2004-07-06 21:34           ` Daniel Phillips
  2004-07-07 18:16             ` Lars Marowsky-Bree
  0 siblings, 1 reply; 55+ messages in thread
From: Daniel Phillips @ 2004-07-06 21:34 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: linux-kernel, Lon Hohberger

Hi Lars,

On Tuesday 06 July 2004 03:34, Lars Marowsky-Bree wrote:
> On 2004-07-05T16:27:51,
>
>    Daniel Phillips <phillips@redhat.com> said:
> > > Indeed. If your efforts in joining the infrastructure are more
> > > successful than ours have been, more power to you ;-)
> >
> > What problems did you run into?
>
> The problems were mostly political. Maybe we tried to push too early,
> but 1-3 years back, people weren't really interested in agreeing on
> some common components or APIs. In particular a certain Linux vendor
> didn't even join the group ;-)

*blush*

> And the "industry" was very reluctant 
> too. Which meant that everybody spend ages talking and not much
> happening.

We're showing up with loads of Sistina code this time.  It's up to 
everybody else to ante up, and yes, I see there's more code out there.  
It's going to be quite a summer reading project.

> However, times may have changed, and hopefully for the better. The
> push to get one solution included into the Linux kernel may be enough
> to convince people that this time its for real...

It's for real, no question.  There are at least two viable GPL code 
bases already, GFS and Lustre, with OCFS2 coming up fast.  And there 
are several commercial (binary/evil) cluster filesystems in service 
already, not that Linus should care about them, but they do lend 
credibility.

> There still is the Open Clustering Framework group though, which is a
> sub-group of the FSG and maybe the right umbrella to put this under,
> to stay away from the impression that it's a single vendor pushing.

Oops, another code base to read ;-)

> If we could revive that and make real progress, I'd be as happy as a
> well fed penguin.

Red Hat is solidly behind this as a _community_ effort.

> Now with OpenAIS on the table, the GFS stack, the work already done
> by OCF in the past (which is, admittedly, depressingly little, but I
> quite like the Resource Agent API for one) et cetera, there may be a
> good chance.
>
> I'll try to get travel approval to go to the meeting.

:-)

> BTW, is the mailing list working? I tried subscribing when you first
> announced it, but the subscription request hasn't been approved
> yet... Maybe I shouldn't have subscribed with the suse.de address ;-)

Perhaps it has more to do with a cross-channel grudge? <grin>

Just poke Alasdair, you know where to find him.

> > On a quick read-through, it seems quite straightforward for quorum,
> > membership and distributed locking.
>
> Believe me, you'd be amazed to find out how long you can argue on how
> to identify a node alone - node name, node number (sparse or
> continuous?), UUID...? ;-)

I can believe it.  What I have just done with my cluster snapshot target 
over the last couple of weeks is, removed _every_ dependency on cluster 
infrastructure and moved the one remaining essential interface to user 
space.  In this way the infrastructure becomes pluggable from the 
cluster block device's point of view and you can run the target without 
any cluster infrastructure at all if you want (just dmsetup and a 
utility for connecting a socket to the target).  This is a general 
technique that we're now applying to a second block driver.  It's a 
tiny amount of kernel and userspace code which I will post pretty soon.  
With this refactoring, the cluster block driver shrank to less than 
half its former size with no loss of functionality.

The nice thing is, I get to use the existing (SCA) infrastructure, but I 
don't have any dependency on it.

> And, how do you define quorum, and is it always needed? Some
> algorithms don't need quorum (ie, election algorithms can do fine
> without), so a membership service which only works with quorum isn't
> the right component etc...

Oddly enough, there has been much discussion about quorum here as well.  
This must be pluggable, and we must be able to handle multiple, 
independent clusters, with a single node potentially belonging to more 
than one at the same time.  Please see this, for a formal writeup on 
our 2.6 code base:

   http://people.redhat.com/~teigland/sca.pdf

Is this the key to the grand, unified quorum system that will do every 
job perfectly?  Good question, however I do know how to make it 
pluggable for my own component, at essentially zero cost.  This makes 
me optimistic that we can work out something sensible, and that perhaps 
it's already a solved problem.

It looks like fencing is more of an issue, because having several node 
fencing systems running at the same time in ignorance of each other is 
deeply wrong.  We can't just wave our hands at this by making it 
pluggable, we need to settle on one that works and use it.  I'll humbly 
suggest that Sistina is furthest along in this regard.

> > The idea of having more than one node fencing system running at the
> > same time seems deeply scary, we'd better make some effort to come
> > up with something common.
>
> Yes. This is actually an important point, and fencing policies are
> also reasonably complex. The GFS stack seems to tie fencing quite
> deeply into the system (which is understandable, since you always
> have shared storage, otherwise a node wouldn't be part of the GFS
> domain in the first place).

Oops, should have read ahead ;)  The DLM is also tied deeply into the 
GFS stack, but that factors out nicely, and in fact, GFS can currently 
use two completely different fencing systems (GULM vs SCA-Fence).  I 
think we can sort this out.

> However, the new dependency based cluster resource manager we are
> writing right now (which we simply call "Cluster Resource Manager"
> for lack of creativity ;) decides whether or not it needs to fence a
> node based on the resources in the cluster - if it isn't affecting
> the resources we can run on the remaining nodes, or none of the
> resources requires node-level fencing, no such operation will be
> done.

Cluster resource management is the least advanced of the components that 
our Red Hat Sistina group has to offer, mainly because it is seen as a 
matter of policy, and so the pressing need at this state is to provide 
suitable hooks.  Lon Hohberger is working on system that works with the 
SCA framework (Magma).  The preexisting Red Hat cluster team decided to 
re-roll their whole cluster suite within the new framework.  Perhaps 
you would like to take a look, and tell us why this couldn't possibly 
work for you?  (Or maybe we need to get you drunk first...)

> This has advantages in larger clusters (where, if split, each
> partition could still continue to run resources which are unaffected
> by the split even the other nodes cannot be fenced), in shared
> nothing clusters or resources which are self-fencing and do not need
> STONITH etc.

"STOMITH" :)  Yes, exactly.  Global load balancing is another big item, 
i.e., which node gets assigned the job of running a particular service, 
which means you need to know how much of each of several different 
kinds of resources a particular service requires, and what the current 
resource usage profile is for each node on the cluster.  Rik van Riel 
is taking a run at this.

It's a huge, scary problem.  We _must_ be able to plug in different 
solutions, all the way from completely manual to completely automagic, 
and we have to be able to handle more than one at once.

> The ties between membership, quorum and fencing are not as strong in
> these scenarios, at least not mandatory. So a stack which enforced
> fencing at these levels, and w/o coordinating with the CRM first,
> would not work out.

Yes, again, fencing looks like the one we have to fret about.  The 
others will be a lot easier to mix and match.

> And by pushing for inclusion into the main kernel, you'll also raise
> all sleeping zom^Wbeauties. I hope you have a long breath for the
> discussions ;-)

You know I do!

> There's lots of work there.

Indeed, and I didn't do any work today yet, due to answering email.

Incidently, there is already a nice crosssection of the cluster 
community on the way to sunny Minneapolis for the July meeting.  We've 
reached about 50% capacity, and we have quorum, I think :-)

Regards,

Daniel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-06 21:34           ` Daniel Phillips
@ 2004-07-07 18:16             ` Lars Marowsky-Bree
  2004-07-08  1:14               ` Daniel Phillips
  0 siblings, 1 reply; 55+ messages in thread
From: Lars Marowsky-Bree @ 2004-07-07 18:16 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, Lon Hohberger

On 2004-07-06T17:34:51,
   Daniel Phillips <phillips@redhat.com> said:

> > And the "industry" was very reluctant 
> > too. Which meant that everybody spend ages talking and not much
> > happening.
> We're showing up with loads of Sistina code this time.  It's up to 
> everybody else to ante up, and yes, I see there's more code out there.  
> It's going to be quite a summer reading project.

Yeah, I wish you the best. There's always been quite a bit of code to
show, but that alone didn't convince people ;-) I've certainly grown a
bit more experienced / cynical during that time. (Which, according to
Oscar Wilde, is the same anyway ;)

> It's for real, no question.  There are at least two viable GPL code 
> bases already, GFS and Lustre, with OCFS2 coming up fast.

Yes, they have some common requirements on the kernel VFS layer, though
Lustre certainly has the most extensive demands. I hope someone from CFS
Inc can make it to your summit.

> I can believe it.  What I have just done with my cluster snapshot target 
> over the last couple of weeks is, removed _every_ dependency on cluster 
> infrastructure and moved the one remaining essential interface to user 
> space.

Is there a KS presentation on this? I didn't get invited to KS and will
just be allowed in for OLS, but I'll be around town already...

> Oddly enough, there has been much discussion about quorum here as well.  
> This must be pluggable, and we must be able to handle multiple, 
> independent clusters, with a single node potentially belonging to more 
> than one at the same time.  Please see this, for a formal writeup on 
> our 2.6 code base:
> 
>    http://people.redhat.com/~teigland/sca.pdf

Thanks for the pointer, this is a good read.

> It looks like fencing is more of an issue, because having several node 
> fencing systems running at the same time in ignorance of each other is 
> deeply wrong.  We can't just wave our hands at this by making it 
> pluggable, we need to settle on one that works and use it.  I'll humbly 
> suggest that Sistina is furthest along in this regard.

Your fencing system is fine with me; based on the assumption that you
always have to fence a failed node, you are doing the right thing.
However, the issues are more subtle when this is no longer true, and in
a 1:1 how do you arbitate who is allowed to fence?

> Cluster resource management is the least advanced of the components that 
> our Red Hat Sistina group has to offer, mainly because it is seen as a 
> matter of policy, and so the pressing need at this state is to provide 
> suitable hooks.

> "STOMITH" :)  Yes, exactly.  Global load balancing is another big item, 
> i.e., which node gets assigned the job of running a particular service, 
> which means you need to know how much of each of several different 
> kinds of resources a particular service requires, and what the current 
> resource usage profile is for each node on the cluster.  Rik van Riel 
> is taking a run at this.

Right, cluster resource management is one of the things where I'm quite
happy with the approach the new heartbeat resource manager is heading
down (or up, I hope ;).

> It's a huge, scary problem.  We _must_ be able to plug in different 
> solutions, all the way from completely manual to completely automagic, 
> and we have to be able to handle more than one at once.

You can plug multiple ones as long as they are managing independent
resources, obviously. However, if the CRM is the one which ultimately
decides whether a node needs to be fenced or not - based on its
knowledge of which resources it owns or could own - this gets a lot more
scary still...

> Yes, again, fencing looks like the one we have to fret about.  The 
> others will be a lot easier to mix and match.

Mostly, yes. Unless you (like some) require quorum to report a cluster
membership, which some implementations do.

> Incidently, there is already a nice crosssection of the cluster 
> community on the way to sunny Minneapolis for the July meeting.  We've 
> reached about 50% capacity, and we have quorum, I think :-)

Uhm, do I have to be frightened of being fenced? ;)


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	    \ ever tried. ever failed. no matter.
SUSE Labs, Research and Development | try again. fail again. fail better.
SUSE LINUX AG - A Novell company    \ 	-- Samuel Beckett


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-05 20:29       ` Daniel Phillips
@ 2004-07-07 22:55         ` Steven Dake
  2004-07-08  1:30           ` Daniel Phillips
  0 siblings, 1 reply; 55+ messages in thread
From: Steven Dake @ 2004-07-07 22:55 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Chris Friesen, Christoph Hellwig, linux-kernel

On Mon, 2004-07-05 at 13:29, Daniel Phillips wrote:
> On Monday 05 July 2004 15:08, Chris Friesen wrote:
> > Daniel Phillips wrote:
> > > Don't you think we ought to take a look at how OCFS and GFS might
> > > share some of the same infrastructure, for example, the DLM and
> > > cluster membership services?
> >
> > For cluster membership, you might consider looking at the OpenAIS CLM
> > portion.  It would be nice if this type of thing was unified across 
> > more than just filesystems.
> 
> My own project is a block driver, that's not a filesystem, right?  
> Cluster membership services as implemented by Sistina are generic, 
> symmetric and (hopefully) raceless.  See:
> 
>   http://www.usenix.org/publications/library/proceedings/als00/2000papers/papers/full_papers/preslan/preslan.pdf
> 
> There is much overlap between the OpenAIS and Sistina's Symmetric 
> Cluster Architecture.  You are right, we do need to get together.
> 
> By the way, how do I get your source code if I don't agree with the 
> BitKeeper license?
> 

Daniel

If you mean how do you get source code to the openais project without
bk, it is available as a nightly tarball download from
developer.osdl.org:

http://developer.osdl.org/cherry/openais

If you want to contribute to openais, you can still contribute by using
diff by sending patches to:

openais@lists.osdl.org

Regards
-steve

> Regards,
> 
> Daniel
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-07 18:16             ` Lars Marowsky-Bree
@ 2004-07-08  1:14               ` Daniel Phillips
  2004-07-08  9:10                 ` Lars Marowsky-Bree
  0 siblings, 1 reply; 55+ messages in thread
From: Daniel Phillips @ 2004-07-08  1:14 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: linux-kernel, Lon Hohberger, David Teigland

On Wednesday 07 July 2004 14:16, Lars Marowsky-Bree wrote:
> On 2004-07-06T17:34:51, Daniel Phillips <phillips@redhat.com> said:
> > > And the "industry" was very reluctant
> > > too. Which meant that everybody spend ages talking and not much
> > > happening.
> >
> > We're showing up with loads of Sistina code this time.  It's up to
> > everybody else to ante up, and yes, I see there's more code out
> > there. It's going to be quite a summer reading project.
>
> Yeah, I wish you the best. There's always been quite a bit of code to
> show, but that alone didn't convince people ;-) I've certainly grown
> a bit more experienced / cynical during that time. (Which, according
> to Oscar Wilde, is the same anyway ;)

OK, what I've learned from the discussion so far is, we need to avoid 
getting stuck too much on the HA aspects and focus more on the 
cluster/performance side for now.  There are just too many entrenched 
positions on failover.  Even though every component of the cluster is 
designed to fail over, that's just a small part of what we have to deal 
with:

  - Cluster Volume management
  - Cluster configuration management
  - Cluster membership/quorum
  - Node Fencing
  - Parallel cluster filesystems with local semantics
  - Distributed Locking
  - Cluster mirror block device
  - Cluster snapshot block device
  - Cluster administration interface, including volume managment
  - Cluster resource balancing
  - bits I forgot to mention

Out of that, we need to pick the three or four items we're prepared to 
address immediately, that we can obviously share between at least two 
known cluster filesystems, and get them onto lkml for peer review.  
Trying to push the whole thing as one lump has never worked for 
anybody, and won't work in this case either.  For example, the DLM is 
fairly non-controversial, and important in terms of performance and 
reliability.  Let's start with that.

Furthermore, nobody seems interested in arguing about the cluster block 
devices either, so lets just discuss how they work and get them out of 
the way.

Then let's tackle the low level infrastructure, such as CCS (Cluster 
Configuration System) that does a simple job, that is, it distributes 
configuration files racelessly.

I heard plenty of fascinating discussion of quorum strategies last 
night, and have a number of papers to read as a result.  But that's a 
diversion: it can and must be pluggable.  We just need to agree on how 
the plugs work, a considerably less ambitious task.

In general, the principle is: the less important it is, the more 
argument there will be about it.  Defer that, make it pluggable, call 
it policy, push it to user space, and move on.  We need to agree on the 
basics so that we can manage network volumes with cluster filesystems 
on top of them.

> > I can believe it.  What I have just done with my cluster snapshot
> > target over the last couple of weeks is, removed _every_ dependency
> > on cluster infrastructure and moved the one remaining essential
> > interface to user space.
>
> Is there a KS presentation on this? I didn't get invited to KS and
> will just be allowed in for OLS, but I'll be around town already...

There will be a BOF at OLS, "Cluster Infrastructure".  Since I didn't 
get a KS invite either and what remains is more properly lkml stuff 
anyway, I will go canoing with Matt O'Keefe during KS as planned.  We 
already did the necessary VFS fixups over the last year (save the 
non-critical flock patch, which is now in play) so there is nothing 
much left to beg Linus for.  There are additional VFS hooks that would 
be nice to have for optimization, but they can wait, people will 
appreciate them more that way ;)

The non-vfs cluster infrastructure just uses the normal module API, 
except for a couple of places in the DM cluster block devices where 
I've allowed myself some creative license, easily undone.  Again, this 
is lkml material, not KS stuff.

> > It looks like fencing is more of an issue, because having several
> > node fencing systems running at the same time in ignorance of each
> > other is deeply wrong.  We can't just wave our hands at this by
> > making it pluggable, we need to settle on one that works and use
> > it.  I'll humbly suggest that Sistina is furthest along in this
> > regard.
>
> Your fencing system is fine with me; based on the assumption that you
> always have to fence a failed node, you are doing the right thing.
> However, the issues are more subtle when this is no longer true, and
> in a 1:1 how do you arbitate who is allowed to fence?

Good question.  Since two-node clusters are my primary interest at the 
moment, I need some answers.  I think the current plan is: they try to 
fence each other, winner take all.  Each node will introspect to decide 
if it's in good enough shape to do the job itself, then go try to fence 
the other one.  Alternatively, they can be configured so that one has 
more votes than the other, if somebody wants that broken arrangement.  

This is my dim recollection, I'll have more to say when I've actually 
hooked my stuff up to it.  There are others with plenty of experience 
in this, see below.

> > Cluster resource management is the least advanced of the components
> > that our Red Hat Sistina group has to offer, mainly because it is
> > seen as a matter of policy, and so the pressing need at this state
> > is to provide suitable hooks.
> >
> > "STOMITH" :)  Yes, exactly.  Global load balancing is another big
> > item, i.e., which node gets assigned the job of running a
> > particular service, which means you need to know how much of each
> > of several different kinds of resources a particular service
> > requires, and what the current resource usage profile is for each
> > node on the cluster.  Rik van Riel is taking a run at this.
>
> Right, cluster resource management is one of the things where I'm
> quite happy with the approach the new heartbeat resource manager is
> heading down (or up, I hope ;).

Combining heartbeat and resource management sounds like a good idea.  
Currently, we have them separate and since I have not tried it myself 
yet, I'll reserve comment.  Dave Teigland would be more than happy to 
wax poetic, though.

> > It's a huge, scary problem.  We _must_ be able to plug in different
> > solutions, all the way from completely manual to completely
> > automagic, and we have to be able to handle more than one at once.
>
> You can plug multiple ones as long as they are managing independent
> resources, obviously. However, if the CRM is the one which ultimately
> decides whether a node needs to be fenced or not - based on its
> knowledge of which resources it owns or could own - this gets a lot
> more scary still...

We do not see the CRM as being involved in fencing at present, though I 
can see why perhaps it ought to be.  The resource manager that Lon 
Hohberger is cooking up is scriptable and rule-driven.  I'm sure we 
could spend 100% of the available time on that alone.  My strategy is, 
I send my manually-configurable cluster bits to Lon and he hooks them 
in so everything is automagic, then I look at how much the end result 
sucks/doesn't suck.

There's some philosophy at work here: I feel that any cluster device 
that requires elaborate infrastructure and configuration to run is 
broken.  If you can set the cluster devices up manually and they depend 
only on existing kernel interfaces, they're more likely to get unit 
testing.  At the same time, these devices have to fit well into a 
complex infrastructure, therefore the manual interface can be driven 
equally well by a script or C program, and there is one tiny but 
crucial additional hook to allow for automatic reconnection to the 
cluster if something bad happens, or if the resource manager just feels 
the need to reorganize things.

So while I'm rambling here, I'll mention that the resource manager (or 
anybody else) can just summarily cut the block target's pipe and the 
block target will politely go ask for a new one.  No IOs will be 
failed, nothing will break, no suspend needed, just one big breaker 
switch to throw.  This of course depends on the target using a pipe 
(socket) to communicate with the cluster, but even if I do switch to 
UDP, I'll still keep at least one pipe around, just because it makes 
the target so easy to control.

It didn't start this way.  The first prototype had a couple thousand 
lines of glue code to work with various possible infrastructures.  Now 
that's all gone and there are just two pipes left, one to local user 
space for cluster management and the other to somewhere out on the 
cluster for synchronization.  It's now down to 30% of the original size 
and runs faster as a bonus.  All cluster interfaces are "read/write", 
except for one ioctl to reconnect a broken pipe.

> > Incidently, there is already a nice crosssection of the cluster
> > community on the way to sunny Minneapolis for the July meeting. 
> > We've reached about 50% capacity, and we have quorum, I think :-)
>
> Uhm, do I have to be frightened of being fenced? ;)

Only if you drink too much of that kluster Koolaid

Regards,

Daniel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-07 22:55         ` Steven Dake
@ 2004-07-08  1:30           ` Daniel Phillips
  0 siblings, 0 replies; 55+ messages in thread
From: Daniel Phillips @ 2004-07-08  1:30 UTC (permalink / raw)
  To: sdake; +Cc: Chris Friesen, Christoph Hellwig, linux-kernel

On Wednesday 07 July 2004 18:55, Steven Dake wrote:
> On Mon, 2004-07-05 at 13:29, Daniel Phillips wrote:
> > On Monday 05 July 2004 15:08, Chris Friesen wrote:
> > > For cluster membership, you might consider looking at the OpenAIS
> > > CLM portion.  It would be nice if this type of thing was unified
> > > across more than just filesystems.
> >
> > My own project is a block driver, that's not a filesystem, right?
> > Cluster membership services as implemented by Sistina are generic,
> > symmetric and (hopefully) raceless.  See:
> >  
> > http://www.usenix.org/publications/library/proceedings/als00/2000pa
> >pers/papers/full_papers/preslan/preslan.pdf

Whoops, I just noticed that that link is way wrong, I must have been 
asleep when I posted it.  This is the correct one:

   http://people.redhat.com/~teigland/sca.pdf

and

   http://sources.redhat.com/cluster/cman/

Not that the other isn't interesting, it's just a little dated and 
GFS-specific.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-08  1:14               ` Daniel Phillips
@ 2004-07-08  9:10                 ` Lars Marowsky-Bree
  2004-07-08 10:53                   ` David Teigland
  0 siblings, 1 reply; 55+ messages in thread
From: Lars Marowsky-Bree @ 2004-07-08  9:10 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, Lon Hohberger, David Teigland

On 2004-07-07T21:14:07,
   Daniel Phillips <phillips@redhat.com> said:

> OK, what I've learned from the discussion so far is, we need to avoid 
> getting stuck too much on the HA aspects and focus more on the 
> cluster/performance side for now.  There are just too many entrenched 
> positions on failover.

Well, first, failover is not all of HA. But that's a different diversion
again.

> Out of that, we need to pick the three or four items we're prepared to 
> address immediately, that we can obviously share between at least two 
> known cluster filesystems, and get them onto lkml for peer review.  

Ok.

> For example, the DLM is fairly non-controversial, and important in
> terms of performance and reliability.  Let's start with that.

I doubt that assessment, the DLM is going to be somewhat controversial
already and requires the dragging in of membership, inter-node
messaging, fencing and quorum. The problem is that you cannot easily
separate out the different pieces.

I'd humbly suggest to start with the changes in the VFS layers which the
CFS's of the different kinds require, regardless of which infrastructure
they use.

Of all the cluster-subsystems, the fencing system is likely the most
important. If the various implementations don't step on eachothers toes
there, the duplication of membership/messaging/etc is only inefficient,
but not actively harmful.

> I heard plenty of fascinating discussion of quorum strategies last 
> night, and have a number of papers to read as a result.  But that's a 
> diversion: it can and must be pluggable.  We just need to agree on how 
> the plugs work, a considerably less ambitious task.

When you argue whether or not you can mandate quorum for a given cluster
implementation, and which layers of the cluster are allowed to require
quorum (some will refuse to even tell you the membership without quorum;
some will require quorum before they fence, others will recover quorum
by fencing), this discussion is fairly complex.

Again, let's see what kernel hooks these require, and defer all the rest
of the discussions as far as possible.

> it policy, push it to user space, and move on.  We need to agree on the 
> basics so that we can manage network volumes with cluster filesystems 
> on top of them.

Ah, that in itself is a very data-centric point of view and not exactly
applicable to the needs of shared-nothing clusters. (I'm not trying to
nitpick, just trying to make you aware of all the hidden assumptions you
may not be aware of yourself.) Of course, this is perfectly fine for
something such as GFS (which, being SAN based, of course requires
these), but a cluster infrastructure in the kernel may not be limitted
to this.

> > Is there a KS presentation on this? I didn't get invited to KS and
> > will just be allowed in for OLS, but I'll be around town already...
> There will be a BOF at OLS, "Cluster Infrastructure".  Since I didn't 
> get a KS invite either and what remains is more properly lkml stuff 
> anyway, I will go canoing with Matt O'Keefe during KS as planned. 

Ah, okay.

> > Your fencing system is fine with me; based on the assumption that you
> > always have to fence a failed node, you are doing the right thing.
> > However, the issues are more subtle when this is no longer true, and
> > in a 1:1 how do you arbitate who is allowed to fence?
> Good question.  Since two-node clusters are my primary interest at the 
> moment, I need some answers. 

Two-node clusters are reasonably easy, true.

> I think the current plan is: they try to fence each other, winner take
> all.  Each node will introspect to decide if it's in good enough shape
> to do the job itself, then go try to fence the other one.

Ok, this is essentially what heartbeat does, but it gets more complex
with >2 nodes. In which case your cluster block device is going to run
into interesting synchronization issues, too, I'd venture. (Or at least
drbd does, where we look at replicating across >2 nodes.)

> > resources, obviously. However, if the CRM is the one which ultimately
> > decides whether a node needs to be fenced or not - based on its
> > knowledge of which resources it owns or could own - this gets a lot
> > more scary still...
> We do not see the CRM as being involved in fencing at present, though I 
> can see why perhaps it ought to be.  The resource manager that Lon 
> Hohberger is cooking up is scriptable and rule-driven. 

Frankly, I'm kind of disappointed; why are you cooking up your own once
more? When we set out to write a new dependency-based flexible resource
manager, we explicitly made it clear that it wasn't just meant to run on
top of heartbeat, but in theory on top of any cluster infrastructure.

I know this is the course of Open Source development, and that
"community project" basically means "my wheel be better than your wheel,
and you are allowed to get behind it after we are done, but don't
interfere before that", but I'd have expected some discussions or at
least solicitation of them on the established public mailing lists, just
to keep up the pretense ;-)


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	    \ ever tried. ever failed. no matter.
SUSE Labs, Research and Development | try again. fail again. fail better.
SUSE LINUX AG - A Novell company    \ 	-- Samuel Beckett


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-08  9:10                 ` Lars Marowsky-Bree
@ 2004-07-08 10:53                   ` David Teigland
  2004-07-08 14:14                     ` Chris Friesen
                                       ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: David Teigland @ 2004-07-08 10:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: Daniel Phillips, Lars Marowsky-Bree

On Thu, Jul 08, 2004 at 11:10:43AM +0200, Lars Marowsky-Bree wrote:

> Of all the cluster-subsystems, the fencing system is likely the most
> important. If the various implementations don't step on eachothers toes
> there, the duplication of membership/messaging/etc is only inefficient,
> but not actively harmful.

I'm afraid the fencing issue has been rather misrepresented.  Here's what we're
doing (a lot of background is necessary I'm afraid.)  We have a symmetric,
kernel-based, stand-alone cluster manager (CMAN) that has no ties to anything
else whatsoever.  It'll simply run and answer the question "who's in the
cluster?" by providing a list of names/nodeids.

So, if that's all you want you can just run cman on all your nodes and it'll
tell you who's in the cluster (kernel and userland api's).  CMAN will also do
generic callbacks to tell you when the membership has changed.  Some people can
stop reading here.

In the event of network partitions you can obviously have two cman clusters
form independently (i.e. "split-brain").  Some people care about this.  Quorum
is a trivial true/false property of the cluster.  Every cluster member has a
number of votes and the cluster itself has a number of expected votes.  Using
these simple values, cman does a quick computation to tell you if the cluster
has quorum.  It's a very standard way of doing things -- we modelled it
directly off the VMS-cluster style.  Whether you care about this quorum value
or what you do with it are beside the point.  Some may be interested in
discussing how cman works and participating in further development; if so go
ahead and ask on linux-cluster@redhat.com.  We've been developing and using
cman for 3-4 years.  Are there other valid approaches? of course.  Is cman
suitable for many people? yes.  Suitable for everyone? no.

(see http://sources.redhat.com/cluster/ for patches and mailing list)

What about the DLM?  The DLM we've developed is again modelled exactly after
that in VMS-clusters.  It depends on cman for the necessary clustering input.
Note that it uses the same generic cman api's as any other system.  Again, the
DLM is utterly symmetric; there is no server or master node involved.  Is this
DLM suitable for many people? yes.  For everyone? no.  (Right now gfs and clvm
are the primary dlm users simply because those are the other projects our group
works on.  DLM is in no way specific to either of those.) 

What about Fencing?  Fencing is not a part of the cluster manager, not a part
of the dlm and not a part of gfs.  It's an entirely independent system that
runs on its own in userland.  It depends on cman for cluster information just
like the dlm or gfs does.  I'll repeat what I said on the linux-cluster mailing
list:

--
Fencing is a service that runs on its own in a CMAN cluster; it's entirely
independent from other services.  GFS simply checks to verify fencing is
running before allowing a mount since it's especially dangerous for a mount to
succeed without it.

As soon as a node joins a fencing domain it will be fenced by another domain
member if it fails.  i.e. as soon as a node runs:

> cman_tool join    (joins the cluster)
> fence_tool join   (starts fenced which joins the default fence domain)

it will be fenced by another fence domain member if it fails.  So, you simply
need to configure your nodes to run fence_tool join after joining the cluster
if you want fencing to happen.  You can add any checks later on that you think
are necessary to be sure that the node is in the fence domain.

Running fence_tool leave will remove a node cleanly from the fence domain (it
won't be fenced by other members.)
--

This fencing system is suitable for us in our gfs/clvm work.  It's probably
suitable for others, too.  For everyone? no.  Can be improved with further
development? yes.  A central or difficult issue? not really.  Again, no need to
look at the dlm or gfs or clvm to work with this fencing system.

-- 
Dave Teigland  <teigland@redhat.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-08 10:53                   ` David Teigland
@ 2004-07-08 14:14                     ` Chris Friesen
  2004-07-08 16:06                       ` David Teigland
  2004-07-08 18:22                     ` Daniel Phillips
  2004-07-12 10:14                     ` Lars Marowsky-Bree
  2 siblings, 1 reply; 55+ messages in thread
From: Chris Friesen @ 2004-07-08 14:14 UTC (permalink / raw)
  To: David Teigland; +Cc: linux-kernel, Daniel Phillips, Lars Marowsky-Bree

David Teigland wrote:

> I'm afraid the fencing issue has been rather misrepresented.  Here's 
> what we're
> doing (a lot of background is necessary I'm afraid.)  We have a symmetric,
> kernel-based, stand-alone cluster manager (CMAN) that has no ties to 
> anything
> else whatsoever.  It'll simply run and answer the question "who's in the
> cluster?" by providing a list of names/nodeids.
> 
> So, if that's all you want you can just run cman on all your nodes and 
> it'll
> tell you who's in the cluster (kernel and userland api's).  CMAN will 
> also do
> generic callbacks to tell you when the membership has changed.  Some 
> people can
> stop reading here.

I'm curious--this seems to be exactly what the cluster membership portion of the 
SAF spec provides.  Would it make sense to contribute to that portion of 
OpenAIS, then export the CMAN API on top of it for backwards compatibility?

It just seems like there are a bunch of different cluster messaging, membership, 
etc. systems, and there is a lot of work being done in parallel with different 
implementations of the same functionality.  Now that there is a standard 
emerging for clustering (good or bad, we've got people asking for it) would it 
make sense to try and get behind that standard and try and make a reference 
implementation?

You guys are more experienced than I, but it seems a bit of a waste to see all 
these projects re-inventing the wheel.

Chris

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-08 14:14                     ` Chris Friesen
@ 2004-07-08 16:06                       ` David Teigland
  0 siblings, 0 replies; 55+ messages in thread
From: David Teigland @ 2004-07-08 16:06 UTC (permalink / raw)
  To: Chris Friesen; +Cc: linux-kernel, Daniel Phillips, Lars Marowsky-Bree


> >I'm afraid the fencing issue has been rather misrepresented.  Here's what
> >we're doing (a lot of background is necessary I'm afraid.)  We have a
> >symmetric, kernel-based, stand-alone cluster manager (CMAN) that has no ties
> >to anything else whatsoever.  It'll simply run and answer the question
> >"who's in the cluster?" by providing a list of names/nodeids.
> >
> >So, if that's all you want you can just run cman on all your nodes and it'll
> >tell you who's in the cluster (kernel and userland api's).  CMAN will also
> >do generic callbacks to tell you when the membership has changed.  Some
> >people can stop reading here.
> 
> I'm curious--this seems to be exactly what the cluster membership portion of
> the SAF spec provides.  Would it make sense to contribute to that portion of
> OpenAIS, then export the CMAN API on top of it for backwards compatibility?

That's definately worth investigating.  If the SAF API is only of interest in
userland, then perhaps a library can translate between the SAF api and the
existing interface cman exports to userland.  We'd welcome efforts to make cman
itself more compatible with SAF, too.  We're not very familiar with it, though.


> It just seems like there are a bunch of different cluster messaging,
> membership, etc. systems, and there is a lot of work being done in parallel
> with different implementations of the same functionality.  Now that there is
> a standard emerging for clustering (good or bad, we've got people asking for
> it) would it make sense to try and get behind that standard and try and make
> a reference implementation?
> 
> You guys are more experienced than I, but it seems a bit of a waste to see
> all these projects re-inventing the wheel.

Sure, we're happy to help make this code more useful to others.  We wrote this
for a very immediate and practical reason of course -- to support gfs, clvm,
dlm, etc, but always expected it would be used more broadly.  We've not done a
lot of work with it lately since as I mentioned it was begun years ago.

-- 
Dave Teigland  <teigland@redhat.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-08 10:53                   ` David Teigland
  2004-07-08 14:14                     ` Chris Friesen
@ 2004-07-08 18:22                     ` Daniel Phillips
  2004-07-08 19:41                       ` Steven Dake
  2004-07-12 10:14                     ` Lars Marowsky-Bree
  2 siblings, 1 reply; 55+ messages in thread
From: Daniel Phillips @ 2004-07-08 18:22 UTC (permalink / raw)
  To: David Teigland; +Cc: linux-kernel, Lars Marowsky-Bree

Hi Dave,

On Thursday 08 July 2004 06:53, David Teigland wrote:
> We have a symmetric, kernel-based, stand-alone cluster manager (CMAN)
> that has no ties to anything else whatsoever.  It'll simply run and
> answer the question "who's in the cluster?" by providing a list of
> names/nodeids.

While we're in here, could you please explain why CMAN needs to be 
kernel-based?  (Just thought I'd broach the question before Christoph 
does.)

Regards,

Daniel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-08 18:22                     ` Daniel Phillips
@ 2004-07-08 19:41                       ` Steven Dake
  2004-07-10  4:58                         ` David Teigland
  2004-07-10  4:58                         ` Daniel Phillips
  0 siblings, 2 replies; 55+ messages in thread
From: Steven Dake @ 2004-07-08 19:41 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: David Teigland, linux-kernel, Lars Marowsky-Bree

On Thu, 2004-07-08 at 11:22, Daniel Phillips wrote:
> Hi Dave,
> 
> On Thursday 08 July 2004 06:53, David Teigland wrote:
> > We have a symmetric, kernel-based, stand-alone cluster manager (CMAN)
> > that has no ties to anything else whatsoever.  It'll simply run and
> > answer the question "who's in the cluster?" by providing a list of
> > names/nodeids.
> 
> While we're in here, could you please explain why CMAN needs to be 
> kernel-based?  (Just thought I'd broach the question before Christoph 
> does.)
> 
> Regards,
> 
> Daniel

Daniel,

I have that same question as well.  I can think of several
disadvantages:

1) security faults in the protocol can crash the kernel or violate 
    system security
2) secure group communication is difficult to implement in kernel
    - secure group key protocols can be implemented fairly easily in 
       userspace using packages like openssl.  Implementing these
       protocols in kernel will prove to be very complex.
3) live upgrades are much more difficult with kernel components
4) a standard interface (the SA Forum AIS) is not being used, 
    disallowing replaceability of components.  This is a big deal for
    people interested in clustering that dont want to be locked into
    a partciular implementation.
5) dlm, fencing, cluster messaging (including membership) can be done
    in userspace, so why not do it there.
6) cluster services for the kernel and cluster services for applications
    will fork, because SA Forum AIS will be chosen for application 
   level services.
7) faults in the protocols can bring down all of Linux, instead of one 
    cluster service on one node.
8) kernel changes require much longer to get into the field and are 
   much more difficult to distribute.  userspace applications are much
   simpler to unit test, qualify, and release.

The advantages are:
interrupt driven timers
some possible reduction in latency related to the cost of executing a
system call when sending messages (including lock messages)

I would like to share with you the efforts of the industry standards
body Service Availability Forum (www.saforum.org).  The Forum is
intersted in specifying interfaces for improving availability of a
system.  One of the collections of APIs (called the application
interface specification) utilizes redundant software components using
clustering approaches to improve availability.

The AIS specification specifies APIs for cluster membership, application
failover, checkpointing, eventing, messaging, and distributed locks. 
All of these services are designed to work with multiple nodes.

It would be beneficial to everyone to adopt these standard interfaces. 
Alot of thought has gone into them.  They are pretty solid.  And there
are atleast two open source implementations under way (openais and
linux-ha) and more on the horizon.

One of these projects, the openais project which I maintain, implements
3 of these services (and the rest will be done in the timeframes we are
talking about) in user space without any kernel changes required.  It
would be possible with kernel to userland communication for the cluster
applications (GFS, distributed block device, etc) to use this standard
interface and implementation.  Then we could avoid all of the
unnecessary kernel maintenance and potential problems that come along
with it. 

Are you interested in such an approach?

Thanks
-steve


> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-08 19:41                       ` Steven Dake
@ 2004-07-10  4:58                         ` David Teigland
  2004-07-10  4:58                         ` Daniel Phillips
  1 sibling, 0 replies; 55+ messages in thread
From: David Teigland @ 2004-07-10  4:58 UTC (permalink / raw)
  To: linux-kernel; +Cc: Steven Dake, Daniel Phillips, Lars Marowsky-Bree


On Thu, Jul 08, 2004 at 12:41:21PM -0700, Steven Dake wrote:
> On Thu, 2004-07-08 at 11:22, Daniel Phillips wrote:
> > Hi Dave,
> > 
> > On Thursday 08 July 2004 06:53, David Teigland wrote:
> > > We have a symmetric, kernel-based, stand-alone cluster manager (CMAN)
> > > that has no ties to anything else whatsoever.  It'll simply run and
> > > answer the question "who's in the cluster?" by providing a list of
> > > names/nodeids.
> > 
> > While we're in here, could you please explain why CMAN needs to be 
> > kernel-based?  (Just thought I'd broach the question before Christoph 
> > does.)
> 
> I have that same question as well.  

gfs needs to run in the kernel.  dlm should run in the kernel since gfs uses it
so heavily.  cman is the clustering subsystem on top of which both of those are
built and on which both depend quite critically.  It simply makes most sense to
put cman in the kernel for what we're doing with it.  That's not a dogmatic
position, just a practical one based on our experience.


> I can think of several disadvantages:
> 
> 1) security faults in the protocol can crash the kernel or violate 
>     system security
> 2) secure group communication is difficult to implement in kernel
>     - secure group key protocols can be implemented fairly easily in 
>        userspace using packages like openssl.  Implementing these
>        protocols in kernel will prove to be very complex.
> 3) live upgrades are much more difficult with kernel components
> 4) a standard interface (the SA Forum AIS) is not being used, 
>     disallowing replaceability of components.  This is a big deal for
>     people interested in clustering that dont want to be locked into
>     a partciular implementation.
> 5) dlm, fencing, cluster messaging (including membership) can be done
>     in userspace, so why not do it there.
> 6) cluster services for the kernel and cluster services for applications
>     will fork, because SA Forum AIS will be chosen for application 
>    level services.
> 7) faults in the protocols can bring down all of Linux, instead of one 
>     cluster service on one node.
> 8) kernel changes require much longer to get into the field and are 
>    much more difficult to distribute.  userspace applications are much
>    simpler to unit test, qualify, and release.
> 
> The advantages are:
> interrupt driven timers
> some possible reduction in latency related to the cost of executing a
> system call when sending messages (including lock messages)

This view of advantages/disadvantages seems sensible when working with your
average userland clustering application.  The SAF spec looks pretty nice in
that context.  I think gfs and a kernel-based dlm for gfs are a different
story, though.  They're different enough from other things that few of the same
considerations seem practical.  This has been our experience so far, things
could possibly change for some next-generation (think time span of years).

You'll note that gfs uses external, interchangable locking/cluster systems
which makes it easy to look at alternatives.  cman and dlm are what gfs/clvm
use today; if they prove useful to others that's great, we'd even be happy to
help make them more useful.

-- 
Dave Teigland  <teigland@redhat.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-08 19:41                       ` Steven Dake
  2004-07-10  4:58                         ` David Teigland
@ 2004-07-10  4:58                         ` Daniel Phillips
  2004-07-10 17:59                           ` Steven Dake
  1 sibling, 1 reply; 55+ messages in thread
From: Daniel Phillips @ 2004-07-10  4:58 UTC (permalink / raw)
  To: sdake; +Cc: Daniel Phillips, David Teigland, linux-kernel, Lars Marowsky-Bree

Hi Steven,

On Thursday 08 July 2004 15:41, Steven Dake wrote:
> On Thu, 2004-07-08 at 11:22, Daniel Phillips wrote:
> > While we're in here, could you please explain why CMAN needs to be
> > kernel-based?  (Just thought I'd broach the question before Christoph
> > does.)
>
> Daniel,
>
> I have that same question as well.  I can think of several
> disadvantages:
>
> 1) security faults in the protocol can crash the kernel or violate
>     system security
> 2) secure group communication is difficult to implement in kernel
>     - secure group key protocols can be implemented fairly easily in
>        userspace using packages like openssl.  Implementing these
>        protocols in kernel will prove to be very complex.
> 3) live upgrades are much more difficult with kernel components
> 4) a standard interface (the SA Forum AIS) is not being used,
>     disallowing replaceability of components.  This is a big deal for
>     people interested in clustering that dont want to be locked into
>     a partciular implementation.
> 5) dlm, fencing, cluster messaging (including membership) can be done
>     in userspace, so why not do it there.
> 6) cluster services for the kernel and cluster services for applications
>     will fork, because SA Forum AIS will be chosen for application
>    level services.
> 7) faults in the protocols can bring down all of Linux, instead of one
>     cluster service on one node.
> 8) kernel changes require much longer to get into the field and are
>    much more difficult to distribute.  userspace applications are much
>    simpler to unit test, qualify, and release.
>
> The advantages are:
> interrupt driven timers
> some possible reduction in latency related to the cost of executing a
> system call when sending messages (including lock messages)

I'm not saying you're wrong, but I can think of an advantage you didn't 
mention: a service living in kernel will inherit the PF_MEMALLOC state of the 
process that called it, that is, a VM cache flushing task.  A userspace 
service will not.  A cluster block device in kernel may need to invoke some 
service in userspace at an inconvenient time.

For example, suppose somebody spills coffee into a network node while another 
network node is in PF_MEMALLOC state, busily trying to write out dirty file 
data to it.  The kernel block device now needs to yell to the user space 
service to go get it a new network connection.  But the userspace service may 
need to allocate some memory to do that, and, whoops, the kernel won't give 
it any because it is in PF_MEMALLOC state.  Now what?

> One of these projects, the openais project which I maintain, implements
> 3 of these services (and the rest will be done in the timeframes we are
> talking about) in user space without any kernel changes required.  It
> would be possible with kernel to userland communication for the cluster
> applications (GFS, distributed block device, etc) to use this standard
> interface and implementation.  Then we could avoid all of the
> unnecessary kernel maintenance and potential problems that come along
> with it.
>
> Are you interested in such an approach?

We'd be remiss not to be aware of it, and its advantages.  It seems your 
project is still in early stages.  How about we take pains to ensure that 
your cluster membership service is plugable into the CMAN infrastructure, as 
a starting point.

Though I admit I haven't read through the whole code tree, there doesn't seem 
to be a distributed lock manager there.  Maybe that is because it's so 
tightly coded I missed it?

Regards,

Daniel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
@ 2004-07-10 14:58 James Bottomley
  2004-07-10 16:04 ` David Teigland
  0 siblings, 1 reply; 55+ messages in thread
From: James Bottomley @ 2004-07-10 14:58 UTC (permalink / raw)
  To: David Teigland; +Cc: Linux Kernel

    gfs needs to run in the kernel.  dlm should run in the kernel since gfs uses it
    so heavily.  cman is the clustering subsystem on top of which both of those are
    built and on which both depend quite critically.  It simply makes most sense to
    put cman in the kernel for what we're doing with it.  That's not a dogmatic
    position, just a practical one based on our experience.
    
    
This isn't really acceptable.  We've spent a long time throwing things
out of the kernel so you really need a good justification for putting
things in again.  "it makes sense" and "its just practical" aren't
sufficient.

You also face two other additional hurdles:

1) GFS today uses a user space DLM.  What critical problems does this
have that you suddenly need to move it all into the kernel?

2) We have numerous other clustering products for Linux, none of which
(well except the Veritas one) has any requirement at all on having
pieces in the kernel.  If all the others operate in user space, why does
yours need to be in the kernel?

So do you have a justification for requiring these as kernel components?

James




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-10 14:58 James Bottomley
@ 2004-07-10 16:04 ` David Teigland
  2004-07-10 16:26   ` James Bottomley
  0 siblings, 1 reply; 55+ messages in thread
From: David Teigland @ 2004-07-10 16:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: James Bottomley

On Sat, Jul 10, 2004 at 09:58:02AM -0500, James Bottomley wrote:
>     gfs needs to run in the kernel.  dlm should run in the kernel since gfs
>     uses it so heavily.  cman is the clustering subsystem on top of which
>     both of those are built and on which both depend quite critically.  It
>     simply makes most sense to put cman in the kernel for what we're doing
>     with it.  That's not a dogmatic position, just a practical one based on
>     our experience.
>     
> This isn't really acceptable.  We've spent a long time throwing things out of
> the kernel so you really need a good justification for putting things in
> again.  "it makes sense" and "its just practical" aren't sufficient.

The "it" refers to gfs.  This means gfs doesn't make a lot of sense and isn't
very practical without it.  I'm not the one to speculate on what gfs would
become otherwise, others would do that better.


> You also face two other additional hurdles:
> 
> 1) GFS today uses a user space DLM.  What critical problems does this have
> that you suddenly need to move it all into the kernel?

GFS does not use a user space dlm today.  GFS uses the client-server gulm lock
manager for which the client (gfs) side runs in the kernel and the gulm server
runs in userspace on some other node.  People have naturally been averse to
using servers like this with gfs for a long time and we've finally created the
serverless dlm (a la VMS clusters).  For many people this is the only option
that makes gfs interesting; it's also what the opengfs group was doing.

This is a revealing discussion.  We've worked hard to make gfs's lock manager
independent from gfs itself so it could be useful to others and make gfs less
monolithic.  We could have left it embedded within the file system itself --
that's what most other cluster file systems do.  If we'd done that we would
have avoided this objection altogether but with an inferior design.  The fact
that there's an independent lock manager to point at and question illustrates
our success.  The same goes for the cluster manager.  (We could, of course, do
some simple glueing together and make a monlithic system again :-)


> 2) We have numerous other clustering products for Linux, none of which (well
> except the Veritas one) has any requirement at all on having pieces in the
> kernel.  If all the others operate in user space, why does yours need to be
> in the kernel?

If you want gfs in user space you don't want gfs; you want something different.

-- 
Dave Teigland  <teigland@redhat.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-10 16:04 ` David Teigland
@ 2004-07-10 16:26   ` James Bottomley
  0 siblings, 0 replies; 55+ messages in thread
From: James Bottomley @ 2004-07-10 16:26 UTC (permalink / raw)
  To: David Teigland; +Cc: Linux Kernel

On Sat, 2004-07-10 at 11:04, David Teigland wrote:
> The "it" refers to gfs.  This means gfs doesn't make a lot of sense and isn't
> very practical without it.  I'm not the one to speculate on what gfs would
> become otherwise, others would do that better.

This is what you actually said:

> It simply makes most sense to put cman in the kernel for
> what we're doing with it.

I interpret that to mean you think cman (your cluster manager) should be
in the kernel.  Is this incorrect?

> 
> > You also face two other additional hurdles:
> > 
> > 1) GFS today uses a user space DLM.  What critical problems does this have
> > that you suddenly need to move it all into the kernel?
> 
> GFS does not use a user space dlm today.  GFS uses the client-server gulm lock
> manager for which the client (gfs) side runs in the kernel and the gulm server
> runs in userspace on some other node.  People have naturally been averse to
> using servers like this with gfs for a long time and we've finally created the
> serverless dlm (a la VMS clusters).  For many people this is the only option
> that makes gfs interesting; it's also what the opengfs group was doing.

OK, whatever you choose to call it, the previous lock manager used by
gfs was userspace.

OK, so why is a kernel based DLM the only option that makes GFS
interesting?  What are the concrete advantages you achieve with a kernel
based DLM that you don't get with a user space one?  There are plenty of
symmetric serverless userspace DLM implementations that follow the old
VMS (and even updated by Oracle) spec.

Steve Dake has already given a pretty compelling list of why you
shouldn't put the DLM and clustering in the kernel, what is the more
compelling list of reasons why it should be?

> This is a revealing discussion.  We've worked hard to make gfs's lock manager
> independent from gfs itself so it could be useful to others and make gfs less
> monolithic.  We could have left it embedded within the file system itself --
> that's what most other cluster file systems do.  If we'd done that we would
> have avoided this objection altogether but with an inferior design.  The fact
> that there's an independent lock manager to point at and question illustrates
> our success.  The same goes for the cluster manager.  (We could, of course, do
> some simple glueing together and make a monlithic system again :-)

I'm not questioning your goal, merely your in-kernel implementation.
Sharing is good.  However things which are shared don't automatically
have to be in-kernel.

> > 2) We have numerous other clustering products for Linux, none of which (well
> > except the Veritas one) has any requirement at all on having pieces in the
> > kernel.  If all the others operate in user space, why does yours need to be
> > in the kernel?
> 
> If you want gfs in user space you don't want gfs; you want something different.

I didn't say GFS, I said "cluster products".  That's the DLM and CMAN
pieces of your architecture.

Once you can convince us that CMAN et al should be in the kernel, the
next stage of the discussion would be the API.  Several groups (like
GGL, SAF and OCF) have done API work for clusters.  They were mostly
careful to select APIs that avoided mandating cluster policy.  You seem
to have chosen a particular policy (voting quorate) to implement. 
Again, that's a red flag.  Policy should not be in the kernel;  if we
all agree there should be in-kernel APIs for clustering then they should
be sufficiently abstracted to support all current cluster policies.

James



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-10  4:58                         ` Daniel Phillips
@ 2004-07-10 17:59                           ` Steven Dake
  2004-07-10 20:57                             ` Daniel Phillips
  0 siblings, 1 reply; 55+ messages in thread
From: Steven Dake @ 2004-07-10 17:59 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Daniel Phillips, David Teigland, linux-kernel, Lars Marowsky-Bree

Comments inline thanks
-steve

On Fri, 2004-07-09 at 21:58, Daniel Phillips wrote:
> Hi Steven,
> 
> On Thursday 08 July 2004 15:41, Steven Dake wrote:
> > On Thu, 2004-07-08 at 11:22, Daniel Phillips wrote:
> > > While we're in here, could you please explain why CMAN needs to be
> > > kernel-based?  (Just thought I'd broach the question before Christoph
> > > does.)
> >
> > Daniel,
> >
> > I have that same question as well.  I can think of several
> > disadvantages:
> >
> > 1) security faults in the protocol can crash the kernel or violate
> >     system security
> > 2) secure group communication is difficult to implement in kernel
> >     - secure group key protocols can be implemented fairly easily in
> >        userspace using packages like openssl.  Implementing these
> >        protocols in kernel will prove to be very complex.
> > 3) live upgrades are much more difficult with kernel components
> > 4) a standard interface (the SA Forum AIS) is not being used,
> >     disallowing replaceability of components.  This is a big deal for
> >     people interested in clustering that dont want to be locked into
> >     a partciular implementation.
> > 5) dlm, fencing, cluster messaging (including membership) can be done
> >     in userspace, so why not do it there.
> > 6) cluster services for the kernel and cluster services for applications
> >     will fork, because SA Forum AIS will be chosen for application
> >    level services.
> > 7) faults in the protocols can bring down all of Linux, instead of one
> >     cluster service on one node.
> > 8) kernel changes require much longer to get into the field and are
> >    much more difficult to distribute.  userspace applications are much
> >    simpler to unit test, qualify, and release.
> >
> > The advantages are:
> > interrupt driven timers
> > some possible reduction in latency related to the cost of executing a
> > system call when sending messages (including lock messages)
> 
> I'm not saying you're wrong, but I can think of an advantage you didn't 
> mention: a service living in kernel will inherit the PF_MEMALLOC state of the 
> process that called it, that is, a VM cache flushing task.  A userspace 
> service will not.  A cluster block device in kernel may need to invoke some 
> service in userspace at an inconvenient time.
> 
> For example, suppose somebody spills coffee into a network node while another 
> network node is in PF_MEMALLOC state, busily trying to write out dirty file 
> data to it.  The kernel block device now needs to yell to the user space 
> service to go get it a new network connection.  But the userspace service may 
> need to allocate some memory to do that, and, whoops, the kernel won't give 
> it any because it is in PF_MEMALLOC state.  Now what?
> 

overload conditions that have caused the kernel to run low on memory are
a difficult problem, even for kernel components.  Currently openais
includes "memory pools" which preallocate data structures.  While that
work is not yet complete, the intent is to ensure every data area is
preallocated so the openais executive (the thing that does all of the
work) doesn't ever request extra memory once it becomes operational.

This of course, leads to problems in the following system calls which
openais uses extensively:
sys_poll
sys_recvmsg
sys_sendmsg

which require the allocations of memory with GFP_KERNEL, which can then
fail returning ENOMEM to userland.  The openais protocol currently can
handle low memory failures in recvmsg and sendmsg.  This is because it
uses a protocol designed to operate on lossy networks.

The poll system call problem will be rectified by utilizing
sys_epoll_wait which does not allocate any memory (the poll data is
preallocated).

I hope that helps atleast answer that some r&d is underway to solve this
particular overload problem in userspace.

> > One of these projects, the openais project which I maintain, implements
> > 3 of these services (and the rest will be done in the timeframes we are
> > talking about) in user space without any kernel changes required.  It
> > would be possible with kernel to userland communication for the cluster
> > applications (GFS, distributed block device, etc) to use this standard
> > interface and implementation.  Then we could avoid all of the
> > unnecessary kernel maintenance and potential problems that come along
> > with it.
> >
> > Are you interested in such an approach?
> 
> We'd be remiss not to be aware of it, and its advantages.  It seems your 
> project is still in early stages.  How about we take pains to ensure that 
> your cluster membership service is plugable into the CMAN infrastructure, as 
> a starting point.
> 
sounds good

> Though I admit I haven't read through the whole code tree, there doesn't seem 
> to be a distributed lock manager there.  Maybe that is because it's so 
> tightly coded I missed it?
> 

There is as of yet no implementation of the SAF AIS dlock API in
openais.  The work requires about 4 weeks of development for someone
well-skilled.  I'd expect a contribution for this API in the timeframes
that make GFS interesting.

I'd invite you, or others interested in these sorts of services, to
contribute that code, if interested.  If interested in developing such a
service for openais, check out the developer's map (which describes
developing a service for openais) at:

http://developer.osdl.org/dev/openais/src/README.devmap

Thanks!
-steve

> Regards,
> 
> Daniel


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-10 17:59                           ` Steven Dake
@ 2004-07-10 20:57                             ` Daniel Phillips
  2004-07-10 23:24                               ` Steven Dake
  0 siblings, 1 reply; 55+ messages in thread
From: Daniel Phillips @ 2004-07-10 20:57 UTC (permalink / raw)
  To: sdake; +Cc: Daniel Phillips, David Teigland, linux-kernel, Lars Marowsky-Bree

On Saturday 10 July 2004 13:59, Steven Dake wrote:
> > I'm not saying you're wrong, but I can think of an advantage you
> > didn't mention: a service living in kernel will inherit the
> > PF_MEMALLOC state of the process that called it, that is, a VM
> > cache flushing task.  A userspace service will not.  A cluster
> > block device in kernel may need to invoke some service in userspace
> > at an inconvenient time.
> >
> > For example, suppose somebody spills coffee into a network node
> > while another network node is in PF_MEMALLOC state, busily trying
> > to write out dirty file data to it.  The kernel block device now
> > needs to yell to the user space service to go get it a new network
> > connection.  But the userspace service may need to allocate some
> > memory to do that, and, whoops, the kernel won't give it any
> > because it is in PF_MEMALLOC state.  Now what?
>
> overload conditions that have caused the kernel to run low on memory
> are a difficult problem, even for kernel components.  Currently
> openais includes "memory pools" which preallocate data structures. 
> While that work is not yet complete, the intent is to ensure every
> data area is preallocated so the openais executive (the thing that
> does all of the work) doesn't ever request extra memory once it
> becomes operational.
>
> This of course, leads to problems in the following system calls which
> openais uses extensively:
> sys_poll
> sys_recvmsg
> sys_sendmsg
>
> which require the allocations of memory with GFP_KERNEL, which can
> then fail returning ENOMEM to userland.  The openais protocol
> currently can handle low memory failures in recvmsg and sendmsg. 
> This is because it uses a protocol designed to operate on lossy
> networks.
>
> The poll system call problem will be rectified by utilizing
> sys_epoll_wait which does not allocate any memory (the poll data is
> preallocated).

But if the user space service is sitting in the kernel's dirty memory 
writeout path, you have a real problem: the low memory condition may 
never get resolved, rendering your userspace service autistic.  
Meanwhile, whoever is generating the dirty memory just keeps spinning 
and spinning, generating more of it, ensuring that if the system does 
survive the first incident, there's another, worse traffic jam coming 
down the pipe.  To trigger this deadlock, a kernel filesystem or block 
device module just has to lose its cluster connection(s) at the wrong 
time.

> I hope that helps atleast answer that some r&d is underway to solve
> this particular overload problem in userspace.

I'm certain there's a solution, but until it is demonstrated and proved, 
any userspace cluster services must be regarded with narrow squinty 
eyes.

> > Though I admit I haven't read through the whole code tree, there
> > doesn't seem to be a distributed lock manager there.  Maybe that is
> > because it's so tightly coded I missed it?
>
> There is as of yet no implementation of the SAF AIS dlock API in
> openais.  The work requires about 4 weeks of development for someone
> well-skilled.  I'd expect a contribution for this API in the
> timeframes that make GFS interesting.

I suspect you have underestimated the amount of development time 
required.

> I'd invite you, or others interested in these sorts of services, to
> contribute that code, if interested.

Humble suggestion: try grabbing the Red Hat (Sistina) DLM code and see 
if you can hack it to do what you want.  Just write a kernel module 
that exports the DLM interface to userspace in the desired form.

   http://sources.redhat.com/cluster/dlm/

Regards,

Daniel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-10 20:57                             ` Daniel Phillips
@ 2004-07-10 23:24                               ` Steven Dake
  2004-07-11 19:44                                 ` Daniel Phillips
  0 siblings, 1 reply; 55+ messages in thread
From: Steven Dake @ 2004-07-10 23:24 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Daniel Phillips, David Teigland, linux-kernel, Lars Marowsky-Bree

some comments inline


On Sat, 2004-07-10 at 13:57, Daniel Phillips wrote:
> On Saturday 10 July 2004 13:59, Steven Dake wrote:
> > > I'm not saying you're wrong, but I can think of an advantage you
> > > didn't mention: a service living in kernel will inherit the
> > > PF_MEMALLOC state of the process that called it, that is, a VM
> > > cache flushing task.  A userspace service will not.  A cluster
> > > block device in kernel may need to invoke some service in userspace
> > > at an inconvenient time.
> > >
> > > For example, suppose somebody spills coffee into a network node
> > > while another network node is in PF_MEMALLOC state, busily trying
> > > to write out dirty file data to it.  The kernel block device now
> > > needs to yell to the user space service to go get it a new network
> > > connection.  But the userspace service may need to allocate some
> > > memory to do that, and, whoops, the kernel won't give it any
> > > because it is in PF_MEMALLOC state.  Now what?
> >
> > overload conditions that have caused the kernel to run low on memory
> > are a difficult problem, even for kernel components.  Currently
> > openais includes "memory pools" which preallocate data structures. 
> > While that work is not yet complete, the intent is to ensure every
> > data area is preallocated so the openais executive (the thing that
> > does all of the work) doesn't ever request extra memory once it
> > becomes operational.
> >
> > This of course, leads to problems in the following system calls which
> > openais uses extensively:
> > sys_poll
> > sys_recvmsg
> > sys_sendmsg
> >
> > which require the allocations of memory with GFP_KERNEL, which can
> > then fail returning ENOMEM to userland.  The openais protocol
> > currently can handle low memory failures in recvmsg and sendmsg. 
> > This is because it uses a protocol designed to operate on lossy
> > networks.
> >
> > The poll system call problem will be rectified by utilizing
> > sys_epoll_wait which does not allocate any memory (the poll data is
> > preallocated).
> 
> But if the user space service is sitting in the kernel's dirty memory 
> writeout path, you have a real problem: the low memory condition may 
> never get resolved, rendering your userspace service autistic.  
> Meanwhile, whoever is generating the dirty memory just keeps spinning 
> and spinning, generating more of it, ensuring that if the system does 
> survive the first incident, there's another, worse traffic jam coming 
> down the pipe.  To trigger this deadlock, a kernel filesystem or block 
> device module just has to lose its cluster connection(s) at the wrong 
> time.
> 
> > I hope that helps atleast answer that some r&d is underway to solve
> > this particular overload problem in userspace.
> 
> I'm certain there's a solution, but until it is demonstrated and proved, 
> any userspace cluster services must be regarded with narrow squinty 
> eyes.
> 

I agree that a solution must be demonstrated and proved.

There is  another option, which I regularly recommend to anyone that
must deal with memory overload conditions.  Don't size the applications
in such a way as to ever cause memory overload.  This practical approach
requires just a little more thought on application deployment with the
benefit of avoiding the various and many problems with memory overload
that leads to application faults, OS faults, and other sorts of nasty
conditions.

> > > Though I admit I haven't read through the whole code tree, there
> > > doesn't seem to be a distributed lock manager there.  Maybe that is
> > > because it's so tightly coded I missed it?
> >
> > There is as of yet no implementation of the SAF AIS dlock API in
> > openais.  The work requires about 4 weeks of development for someone
> > well-skilled.  I'd expect a contribution for this API in the
> > timeframes that make GFS interesting.
> 
> I suspect you have underestimated the amount of development time 
> required.
> 

The checkpointing api took approx 3 weeks to develop and has many more
functions to implement.  Cluster membership took approx 1 week to
develop.  The AMF which provides application failover, the most
complicated of the APIs, took approx 8 weeks to develop.  The group
messaging protocol (which implements the virtual synchrony model) has
consumed 80% of the development time thus far.

So 4 weeks is reasonable for someone not familiar with the openais
architecture or SA Forum specification, since the virtual synchrony
group messaging protocol is complete enough to implement a lock service
with simple messaging without any race conditions even during network
partitions and merges.

> > I'd invite you, or others interested in these sorts of services, to
> > contribute that code, if interested.
> 
> Humble suggestion: try grabbing the Red Hat (Sistina) DLM code and see 
> if you can hack it to do what you want.  Just write a kernel module 
> that exports the DLM interface to userspace in the desired form.
> 
>    http://sources.redhat.com/cluster/dlm/
> 

I would rather avoid non-mainline kernel dependencies at this time as it
makes adoption difficult until kernel patches are merged into upstream
code.  Who wants to patch their kernel to try out some APIs?  I am
doubtful these sort of kernel patches will be merged without a strong
argument of why it absolutely must be implemented in the kernel vs all
of the counter arguments against a kernel implementation.  

There is one more advantage to group messaging and distributed locking
implemented within the kernel, that I hadn't originally considered; it
sure is sexy.

Regards
-steve

> Regards,
> 
> Daniel


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-10 23:24                               ` Steven Dake
@ 2004-07-11 19:44                                 ` Daniel Phillips
  2004-07-11 21:06                                   ` Lars Marowsky-Bree
  2004-07-12  4:08                                   ` Steven Dake
  0 siblings, 2 replies; 55+ messages in thread
From: Daniel Phillips @ 2004-07-11 19:44 UTC (permalink / raw)
  To: sdake; +Cc: Daniel Phillips, David Teigland, linux-kernel, Lars Marowsky-Bree

On Saturday 10 July 2004 19:24, Steven Dake wrote:
> On Sat, 2004-07-10 at 13:57, Daniel Phillips wrote:
> > On Saturday 10 July 2004 13:59, Steven Dake wrote:
> > > overload conditions that have caused the kernel to run low on memory
> > > are a difficult problem, even for kernel components...
> > > ...I hope that helps atleast answer that some r&d is underway to solve
> > > this particular overload problem in userspace.
> >
> > I'm certain there's a solution, but until it is demonstrated and proved,
> > any userspace cluster services must be regarded with narrow squinty
> > eyes.
>
> I agree that a solution must be demonstrated and proved.
>
> There is  another option, which I regularly recommend to anyone that
> must deal with memory overload conditions.  Don't size the applications
> in such a way as to ever cause memory overload.

That, and "just add more memory" are the two common mistakes people make when 
thinking about this problem.  The kernel _normally_ runs near the low-memory 
barrier, on the theory that caching as much as possible is a good thing.

Unless you can prove that your userspace approach never deadlocks, the other 
questions don't even move the needle.  I am sure that one day somebody, maybe 
you, will demonstrate a userspace approach that is provably correct.  Until 
then, if you want your cluster to stay up and fail over properly, there's 
only one game in town.  

We need to worry about ensuring that no API _depends_ on the cluster manager 
being in-kernel, and we also need to seek out and excise any parts that could 
possibly be moved out to user space without enabling the deadlock or grossly 
messing up the kernel code.

> > > I'd invite you, or others interested in these sorts of services, to
> > > contribute that code, if interested.
> >
> > Humble suggestion: try grabbing the Red Hat (Sistina) DLM code and see
> > if you can hack it to do what you want.  Just write a kernel module
> > that exports the DLM interface to userspace in the desired form.
> >
> >    http://sources.redhat.com/cluster/dlm/
>
> I would rather avoid non-mainline kernel dependencies at this time as it
> makes adoption difficult until kernel patches are merged into upstream
> code.  Who wants to patch their kernel to try out some APIs?

Everybody working on clusters.  It's a fact of life that you have to apply 
patches to run cluster filesystems right now.  Production will be a different 
story, but (except for the stable GFS code on 2.4) nobody is close to that.

> I am doubtful these sort of kernel patches will be merged without a strong
> argument of why it absolutely must be implemented in the kernel vs all
> of the counter arguments against a kernel implementation.

True.  Do you agree that the PF_MEMALLOC argument is a strong one?

> There is one more advantage to group messaging and distributed locking
> implemented within the kernel, that I hadn't originally considered; it
> sure is sexy.

I don't think it's sexy, I think it's ugly, to tell the truth.  I am actively 
researching how to move the slow-path cluster infrastructure out of kernel, 
and I would be pleased to work together with anyone else who is interested in 
this nasty problem.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-11 19:44                                 ` Daniel Phillips
@ 2004-07-11 21:06                                   ` Lars Marowsky-Bree
  2004-07-12  6:58                                     ` Arjan van de Ven
  2004-07-12  4:08                                   ` Steven Dake
  1 sibling, 1 reply; 55+ messages in thread
From: Lars Marowsky-Bree @ 2004-07-11 21:06 UTC (permalink / raw)
  To: Daniel Phillips, sdake; +Cc: David Teigland, linux-kernel

On 2004-07-11T15:44:25,
   Daniel Phillips <phillips@istop.com> said:

> Unless you can prove that your userspace approach never deadlocks, the other 
> questions don't even move the needle.  I am sure that one day somebody, maybe 
> you, will demonstrate a userspace approach that is provably correct.  

If you can _prove_ your kernel-space implementation to be correct, I'll
drop all and every single complaint ;)

> Until then, if you want your cluster to stay up and fail over
> properly, there's only one game in town.  

This however is not true; clusters have managed just fine running in
user-space (realtime priority, mlocked into (pre-allocated) memory
etc).

I agree that for a cluster filesystem it's much lower latency to have
the infrastructure in the kernel. Going back and forth to user-land just
ain't as fast and also not very neat.

However, the memory argument is pretty weak; the memory for
heartbeating and core functionality must be pre-allocated if you care
that much. And if you cannot allocate it, maybe you ain't healthy enough
to join the cluster in the first place.

Otherwise, I don't much care about whether it's in-kernel or not.

My main argument against being in the kernel space has always been
portability and ease of integration, which makes this quite annoying for
ISVs, and the support issues which arise. But if it's however a common
component part of the 'kernel proper', then this argument no longer
holds.

If the infrastructure takes that jump, I'd be happy. Infrastructure is
boring and has been solved/reinvented so often there's hardly anything
new and exciting about heartbeating, membership, there's more fun work
higher up the stack.

> > There is one more advantage to group messaging and distributed
> > locking implemented within the kernel, that I hadn't originally
> > considered; it sure is sexy.
> I don't think it's sexy, I think it's ugly, to tell the truth.  I am
> actively researching how to move the slow-path cluster infrastructure
> out of kernel, and I would be pleased to work together with anyone
> else who is interested in this nasty problem.

Messaging (which hopefully includes strong authentication if not
encryption, though I could see that being delegated to IPsec) and
locking is in the fast-path, though.


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	    \ ever tried. ever failed. no matter.
SUSE Labs, Research and Development | try again. fail again. fail better.
SUSE LINUX AG - A Novell company    \ 	-- Samuel Beckett


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-11 19:44                                 ` Daniel Phillips
  2004-07-11 21:06                                   ` Lars Marowsky-Bree
@ 2004-07-12  4:08                                   ` Steven Dake
  2004-07-12  4:23                                     ` Daniel Phillips
  1 sibling, 1 reply; 55+ messages in thread
From: Steven Dake @ 2004-07-12  4:08 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Daniel Phillips, David Teigland, linux-kernel, Lars Marowsky-Bree

On Sun, 2004-07-11 at 12:44, Daniel Phillips wrote:
> On Saturday 10 July 2004 19:24, Steven Dake wrote:
> > On Sat, 2004-07-10 at 13:57, Daniel Phillips wrote:
> > > On Saturday 10 July 2004 13:59, Steven Dake wrote:
> > > > overload conditions that have caused the kernel to run low on memory
> > > > are a difficult problem, even for kernel components...
> > > > ...I hope that helps atleast answer that some r&d is underway to solve
> > > > this particular overload problem in userspace.
> > >
> > > I'm certain there's a solution, but until it is demonstrated and proved,
> > > any userspace cluster services must be regarded with narrow squinty
> > > eyes.
> >
> > I agree that a solution must be demonstrated and proved.
> >
> > There is  another option, which I regularly recommend to anyone that
> > must deal with memory overload conditions.  Don't size the applications
> > in such a way as to ever cause memory overload.
> 
> That, and "just add more memory" are the two common mistakes people make when 
> thinking about this problem.  The kernel _normally_ runs near the low-memory 
> barrier, on the theory that caching as much as possible is a good thing.
> 

Running "near low memory conditions" and running in memory overload
which triggers the OOM killer and other bad behaviors are two totally
different conditions in the kernel.

> Unless you can prove that your userspace approach never deadlocks, the other 
> questions don't even move the needle.  I am sure that one day somebody, maybe 
> you, will demonstrate a userspace approach that is provably correct.  Until 
> then, if you want your cluster to stay up and fail over properly, there's 
> only one game in town.  
> 

As soon as you have proved that cman's cluster protocol cannot be the
target of attacks which lead to kernel faults or security faults..

Byzantine failures are a fact of life.  There are protocols to minimize
these sorts of attacks, but implementing them in the kernel is going to
prove very difficult (but possible).  One approach is to get them
working in userspace correctly, and port them to the kernel.

Oom conditions are another fact of life for poorly sized systems.  If a
cluster is within an OOM condition, it should be removed from the
cluster (because it is in overload, under which unknown and generally
bad behaviors occur).

The openais project does just this: If everything goes to hell in a
handbasket on the node running the cluster executive, it will be
rejected from the membership.  This rejection is implemented with a
distributed state machine that ensures, even in low memory conditions,
every node (including the failed node) reaches the same conclusions
about the current membership and works today in the current code.  If at
a later time the processor can reenter the membership because it has
freed up some memory, it will do so correctly.

> We need to worry about ensuring that no API _depends_ on the cluster manager 
> being in-kernel, and we also need to seek out and excise any parts that could 
> possibly be moved out to user space without enabling the deadlock or grossly 
> messing up the kernel code.

> > > > I'd invite you, or others interested in these sorts of services, to
> > > > contribute that code, if interested.
> > >
> > > Humble suggestion: try grabbing the Red Hat (Sistina) DLM code and see
> > > if you can hack it to do what you want.  Just write a kernel module
> > > that exports the DLM interface to userspace in the desired form.
> > >
> > >    http://sources.redhat.com/cluster/dlm/
> >
> > I would rather avoid non-mainline kernel dependencies at this time as it
> > makes adoption difficult until kernel patches are merged into upstream
> > code.  Who wants to patch their kernel to try out some APIs?
> 
> Everybody working on clusters.  It's a fact of life that you have to apply 
> patches to run cluster filesystems right now.  Production will be a different 
> story, but (except for the stable GFS code on 2.4) nobody is close to that.
> 

Perhaps people skilled in running pre-alpha software would consider
patching a kernel to "give it a run".  I have no doubts about that.

I would posit a guess people interested in implementing production
clusters are not too interested about applying kernel patches (and
causing their kernel to become unsupported) to achieve clustering
support any time soon.

> > I am doubtful these sort of kernel patches will be merged without a strong
> > argument of why it absolutely must be implemented in the kernel vs all
> > of the counter arguments against a kernel implementation.
> 
> True.  Do you agree that the PF_MEMALLOC argument is a strong one?
> 

out of memory overload is a sucky situation poorly handled by any
software, kernel, userland, embedded, whatever.  The best solution is to
size the applications such that a memory overload doesn't occur.  Then
if a memory overload condition does occur, that node should aleast
become suspected of a byzantine failure condition which should cause its
rejection from the current membership (in the case of a distributed
system such as a cluster).

> > There is one more advantage to group messaging and distributed locking
> > implemented within the kernel, that I hadn't originally considered; it
> > sure is sexy.
> 
> I don't think it's sexy, I think it's ugly, to tell the truth.  I am actively 
> researching how to move the slow-path cluster infrastructure out of kernel, 
> and I would be pleased to work together with anyone else who is interested in 
> this nasty problem.
> 
There can be some advantages to group messaging being implemented in the
kernel, if it is secure, done correctly (in my view, correctly means
implementing the virtual synchrony model) and has low risk of impact to
other systems.

There are no kernel implemented clustering protocols that come close to
these goals today.

There are userland implementations under way which will meet these
objectives.

Perhaps these protocols could be ported to the kernel if group messaging
absolutely must be available to kernel components without userland
intervention.  But I'm still not convinced userland isn't the correct
place for these sorts of things.

Thanks
-steve

> Regards,
> 
> Daniel


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-12  4:08                                   ` Steven Dake
@ 2004-07-12  4:23                                     ` Daniel Phillips
  2004-07-12 18:21                                       ` Steven Dake
  0 siblings, 1 reply; 55+ messages in thread
From: Daniel Phillips @ 2004-07-12  4:23 UTC (permalink / raw)
  To: sdake; +Cc: Daniel Phillips, David Teigland, linux-kernel, Lars Marowsky-Bree

On Monday 12 July 2004 00:08, Steven Dake wrote:
> On Sun, 2004-07-11 at 12:44, Daniel Phillips wrote:
> Oom conditions are another fact of life for poorly sized systems.  If
> a cluster is within an OOM condition, it should be removed from the
> cluster (because it is in overload, under which unknown and generally
> bad behaviors occur).

You missed the point.  The memory deadlock I pointed out occurs in 
_normal operation_.  You have to find a way around it, or kernel 
cluster services win, plain and simple.

> The openais project does just this: If everything goes to hell in a
> handbasket on the node running the cluster executive, it will be
> rejected from the membership.  This rejection is implemented with a
> distributed state machine that ensures, even in low memory
> conditions, every node (including the failed node) reaches the same
> conclusions about the current membership and works today in the
> current code.  If at a later time the processor can reenter the
> membership because it has freed up some memory, it will do so
> correctly.

Think about it.  Do you want nodes spontaneously falling over from time 
to time, even though nothing is wrong with them?  What does that do 
your 5 nines?

> > > I would rather avoid non-mainline kernel dependencies at this
> > > time as it makes adoption difficult until kernel patches are
> > > merged into upstream code.  Who wants to patch their kernel to
> > > try out some APIs?
> >
> > Everybody working on clusters.  It's a fact of life that you have
> > to apply patches to run cluster filesystems right now.  Production
> > will be a different story, but (except for the stable GFS code on
> > 2.4) nobody is close to that.
>
> Perhaps people skilled in running pre-alpha software would consider
> patching a kernel to "give it a run".  I have no doubts about that.
>
> I would posit a guess people interested in implementing production
> clusters are not too interested about applying kernel patches (and
> causing their kernel to become unsupported) to achieve clustering
> support any time soon.

We are _far_ from production, at least on 2.6.  At this point, we are 
only interested in people who like to code, test, tinker, and be the 
first kid on the block with a shiny new storage cluster in their rec 
room.  And by "we" I mean "you, me, and everybody else who hopes that 
Linux will kick butt in clusters, in the 2.8 time frame."

> > > I am doubtful these sort of kernel patches will be merged without
> > > a strong argument of why it absolutely must be implemented in the
> > > kernel vs all of the counter arguments against a kernel
> > > implementation.
> >
> > True.  Do you agree that the PF_MEMALLOC argument is a strong one?
>
> out of memory overload is a sucky situation poorly handled by any
> software, kernel, userland, embedded, whatever.

In case you missed it above, please let me point out one more time that 
I am not talking about OOM.  I'm talking about a deadlock that may come 
up even when a resource usage is well within limits, which is inherent 
in the basic design of Linux.  There is nothing Byzantine about it.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-11 21:06                                   ` Lars Marowsky-Bree
@ 2004-07-12  6:58                                     ` Arjan van de Ven
  2004-07-12 10:05                                       ` Lars Marowsky-Bree
  0 siblings, 1 reply; 55+ messages in thread
From: Arjan van de Ven @ 2004-07-12  6:58 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: Daniel Phillips, sdake, David Teigland, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 793 bytes --]


> 
> This however is not true; clusters have managed just fine running in
> user-space (realtime priority, mlocked into (pre-allocated) memory
> etc).

(ignoring the entire context and argument)

Running realtime and mlocked (prealloced) is most certainly not
sufficient for causes like this; any system call that internally
allocates memory (even if it's just for allocating the kernel side of
the filename you handle to open) can lead to this RT, mlocked process to
cause VM writeout elsewhere. 

While I can't say how this affects your argument, everyone should be
really careful with the "just mlock it" argument because it just doesn't
help the worst case in scenarios like this. (It most obviously helps the
average case so for soft-realtime use it's a good approach)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-12  6:58                                     ` Arjan van de Ven
@ 2004-07-12 10:05                                       ` Lars Marowsky-Bree
  2004-07-12 10:11                                         ` Arjan van de Ven
  0 siblings, 1 reply; 55+ messages in thread
From: Lars Marowsky-Bree @ 2004-07-12 10:05 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Daniel Phillips, sdake, David Teigland, linux-kernel

On 2004-07-12T08:58:46,
   Arjan van de Ven <arjanv@redhat.com> said:

> Running realtime and mlocked (prealloced) is most certainly not
> sufficient for causes like this; any system call that internally
> allocates memory (even if it's just for allocating the kernel side of
> the filename you handle to open) can lead to this RT, mlocked process to
> cause VM writeout elsewhere. 

Of course; appropriate safety measures - like not doing any syscall
which could potentially block, or isolating them from the main task via
double-buffering childs - need to be done. (heartbeat does this in
fact.)

Again, if we have "many" in kernel users requiring high performance &
low-latency, running in the kernel may not be as bad, but I still don't
entirely like it.

But user-space can also manage just fine, and instead continuing the "we
need highperf, low-latency and non-blocking so it must be in the
kernel", we may want to consider how to have high-perf low-latency
kernel/user-space communication so that we can NOT move this into the
kernel.

Suffice to say that many user-space implementations exist which satisfy
these needs quite sufficiently; in the case of a CFS, this argument may
be different, but I'd like to see some hard data to back it up.

(On a practical note, a system which drops out of membership because
allocating a 256 byte buffer for a filename takes longer than the node
deadtime (due to high load) is reasonably unlikely to be a healthy
cluster member anyway and is on its road to eviction already.)

The main reason why I'd like to see cluster infrastructure in the kernel
is not technical, but because it increases the pressure on unification
so much that people might actually get their act together this time ;-)


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	    \ ever tried. ever failed. no matter.
SUSE Labs, Research and Development | try again. fail again. fail better.
SUSE LINUX AG - A Novell company    \ 	-- Samuel Beckett


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-12 10:05                                       ` Lars Marowsky-Bree
@ 2004-07-12 10:11                                         ` Arjan van de Ven
  2004-07-12 10:21                                           ` Lars Marowsky-Bree
  0 siblings, 1 reply; 55+ messages in thread
From: Arjan van de Ven @ 2004-07-12 10:11 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: Daniel Phillips, sdake, David Teigland, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1173 bytes --]


On Mon, Jul 12, 2004 at 12:05:47PM +0200, Lars Marowsky-Bree wrote:
> On 2004-07-12T08:58:46,
>    Arjan van de Ven <arjanv@redhat.com> said:
> 
> > Running realtime and mlocked (prealloced) is most certainly not
> > sufficient for causes like this; any system call that internally
> > allocates memory (even if it's just for allocating the kernel side of
> > the filename you handle to open) can lead to this RT, mlocked process to
> > cause VM writeout elsewhere. 
> 
> Of course; appropriate safety measures - like not doing any syscall
> which could potentially block, or isolating them from the main task via
> double-buffering childs - need to be done. (heartbeat does this in
> fact.)

well the problem is that you cannot prevent a syscall from blocking really.
O_NONBLOCK only impacts the waiting for IO/socket buffer space to not do so
(in general), it doesn't impact the memory allocation strategies by
syscalls. And there's a whopping lot of that in the non-boring syscalls...
So while your heartbeat process won't block during getpid, it'll eventually
need to do real work too .... and I'm quite certain that will lead down to
GFP_KERNEL memory allocations.



[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-08 10:53                   ` David Teigland
  2004-07-08 14:14                     ` Chris Friesen
  2004-07-08 18:22                     ` Daniel Phillips
@ 2004-07-12 10:14                     ` Lars Marowsky-Bree
  2 siblings, 0 replies; 55+ messages in thread
From: Lars Marowsky-Bree @ 2004-07-12 10:14 UTC (permalink / raw)
  To: David Teigland, linux-kernel; +Cc: Daniel Phillips

On 2004-07-08T18:53:38,
   David Teigland <teigland@redhat.com> said:

> I'm afraid the fencing issue has been rather misrepresented.  Here's
> what we're doing (a lot of background is necessary I'm afraid.)  We
> have a symmetric, kernel-based, stand-alone cluster manager (CMAN)
> that has no ties to anything else whatsoever.  It'll simply run and
> answer the question "who's in the cluster?" by providing a list of
> names/nodeids.

Excuse my ignorance, but does this ensure that there's concensus among
the nodes about this membership?

> has quorum.  It's a very standard way of doing things -- we modelled it
> directly off the VMS-cluster style.  Whether you care about this quorum value
> or what you do with it are beside the point. 

OK, I agree with this. As long as the CMAN itself doesn't care about
this either but just reports it to the cluster, that's fine.

> What about Fencing?  Fencing is not a part of the cluster manager, not
> a part of the dlm and not a part of gfs.  It's an entirely independent
> system that runs on its own in userland.  It depends on cman for
> cluster information just like the dlm or gfs does.  I'll repeat what I
> said on the linux-cluster mailing list:

I doubt it can be entirely independent; or how do you implement lock
recovery without a fencing mechanism?

> This fencing system is suitable for us in our gfs/clvm work.  It's
> probably suitable for others, too.  For everyone? no. 

It sounds useful enough even for our work, given appropriate
notification of fencing events; instead of scheduling a fencing event,
we'd need to make sure that the node joins a fencing domain and later
block until receiving a notification. It's not as fine grained, but our
approach (based on the dependencies of the resources managed, basically)
might have been more fine grained than required in a typical
environment.

Yes, I can see how that could be made to work.


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	    \ ever tried. ever failed. no matter.
SUSE Labs, Research and Development | try again. fail again. fail better.
SUSE LINUX AG - A Novell company    \ 	-- Samuel Beckett


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-12 10:11                                         ` Arjan van de Ven
@ 2004-07-12 10:21                                           ` Lars Marowsky-Bree
  2004-07-12 10:28                                             ` Arjan van de Ven
  2004-07-14  8:32                                             ` Pavel Machek
  0 siblings, 2 replies; 55+ messages in thread
From: Lars Marowsky-Bree @ 2004-07-12 10:21 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Daniel Phillips, sdake, David Teigland, linux-kernel

On 2004-07-12T12:11:07,
   Arjan van de Ven <arjanv@redhat.com> said:

> well the problem is that you cannot prevent a syscall from blocking really.
> O_NONBLOCK only impacts the waiting for IO/socket buffer space to not do so
> (in general), it doesn't impact the memory allocation strategies by
> syscalls. And there's a whopping lot of that in the non-boring syscalls...
> So while your heartbeat process won't block during getpid, it'll eventually
> need to do real work too .... and I'm quite certain that will lead down to
> GFP_KERNEL memory allocations.

Sure, but the network IO is isolated from the main process via a _very
careful_ non-blocking IO using sockets library, so that works out well.
The only scenario which could still impact this severely would be that
the kernel did not schedule the soft-rr tasks often enough or all NICs
being so overloaded that we can no longer send out the heartbeat
packets, and some more silly conditions. In either case I'd venture that
said node is so unhealthy that it is quite rightfully evicted from the
cluster. A node which is so overloaded should not be starting any new
resources whatsoever.

However, of course this is more difficult for the case where you are in
the write path needed to free some memory; alas, swapping to a GFS mount
is probably a realllllly silly idea, too.

But again, I'd rather like to see this solved (memory pools for
userland, PF_ etc), because it's relevant for many scenarios requiring
near-hard-realtime properties, and the answer surely can't be to push it
all into the kernel.


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	    \ ever tried. ever failed. no matter.
SUSE Labs, Research and Development | try again. fail again. fail better.
SUSE LINUX AG - A Novell company    \ 	-- Samuel Beckett


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-12 10:21                                           ` Lars Marowsky-Bree
@ 2004-07-12 10:28                                             ` Arjan van de Ven
  2004-07-12 11:50                                               ` Lars Marowsky-Bree
  2004-07-14  8:32                                             ` Pavel Machek
  1 sibling, 1 reply; 55+ messages in thread
From: Arjan van de Ven @ 2004-07-12 10:28 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: Daniel Phillips, sdake, David Teigland, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1450 bytes --]

On Mon, Jul 12, 2004 at 12:21:24PM +0200, Lars Marowsky-Bree wrote:
> On 2004-07-12T12:11:07,
>    Arjan van de Ven <arjanv@redhat.com> said:
> 
> > well the problem is that you cannot prevent a syscall from blocking really.
> > O_NONBLOCK only impacts the waiting for IO/socket buffer space to not do so
> > (in general), it doesn't impact the memory allocation strategies by
> > syscalls. And there's a whopping lot of that in the non-boring syscalls...
> > So while your heartbeat process won't block during getpid, it'll eventually
> > need to do real work too .... and I'm quite certain that will lead down to
> > GFP_KERNEL memory allocations.
> 
> Sure, but the network IO is isolated from the main process via a _very
> careful_ non-blocking IO using sockets library, so that works out well.

... which of course never allocates skb's ? ;)

> However, of course this is more difficult for the case where you are in
> the write path needed to free some memory; alas, swapping to a GFS mount
> is probably a realllllly silly idea, too.

there is more than swap, there's dirty pagecache/mmaps as well

> But again, I'd rather like to see this solved (memory pools for
> userland, PF_ etc), because it's relevant for many scenarios requiring

PF_ is not enough really ;) 
You need to force GFP_NOFS etc for several critical parts, and well, by
being in kernel you can avoid a bunch of these allocations for real, and/or
influence their GFP flags

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-12 10:28                                             ` Arjan van de Ven
@ 2004-07-12 11:50                                               ` Lars Marowsky-Bree
  2004-07-12 12:01                                                 ` Arjan van de Ven
  0 siblings, 1 reply; 55+ messages in thread
From: Lars Marowsky-Bree @ 2004-07-12 11:50 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Daniel Phillips, sdake, David Teigland, linux-kernel

On 2004-07-12T12:28:19,
   Arjan van de Ven <arjanv@redhat.com> said:

> > Sure, but the network IO is isolated from the main process via a _very
> > careful_ non-blocking IO using sockets library, so that works out well.
> ... which of course never allocates skb's ? ;)

No, the interprocess communication does not; it's local sockets. I think
Alan (Robertson) even has a paper on this. It's really quite well
engineered, with a non-blocking poll() implementation based on signals
and stuff. Oh well.

> > But again, I'd rather like to see this solved (memory pools for
> > userland, PF_ etc), because it's relevant for many scenarios requiring
> PF_ is not enough really ;) 
> You need to force GFP_NOFS etc for several critical parts, and well, by
> being in kernel you can avoid a bunch of these allocations for real, and/or
> influence their GFP flags

True enough, but I'm somewhat unhappy with this still. So whenever we
have something like that we need to move it into the kernel space?
(pvmove first, and now the clustering etc.) Can't we come up with a way
to export this flag to user-space?


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	    \ ever tried. ever failed. no matter.
SUSE Labs, Research and Development | try again. fail again. fail better.
SUSE LINUX AG - A Novell company    \ 	-- Samuel Beckett


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-12 11:50                                               ` Lars Marowsky-Bree
@ 2004-07-12 12:01                                                 ` Arjan van de Ven
  2004-07-12 13:13                                                   ` Lars Marowsky-Bree
  0 siblings, 1 reply; 55+ messages in thread
From: Arjan van de Ven @ 2004-07-12 12:01 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: Daniel Phillips, sdake, David Teigland, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 480 bytes --]

On Mon, Jul 12, 2004 at 01:50:03PM +0200, Lars Marowsky-Bree wrote:
> 
> True enough, but I'm somewhat unhappy with this still. So whenever we
> have something like that we need to move it into the kernel space?
> (pvmove first, and now the clustering etc.) Can't we come up with a way
> to export this flag to user-space?

I'm not convinced that's a good idea, in that it exposes what is basically VM internals 
to userspace, which then would become a set-in-stone interface....

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-12 12:01                                                 ` Arjan van de Ven
@ 2004-07-12 13:13                                                   ` Lars Marowsky-Bree
  2004-07-12 13:40                                                     ` Nick Piggin
  0 siblings, 1 reply; 55+ messages in thread
From: Lars Marowsky-Bree @ 2004-07-12 13:13 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Daniel Phillips, sdake, David Teigland, linux-kernel

On 2004-07-12T14:01:27,
   Arjan van de Ven <arjanv@redhat.com> said:

> I'm not convinced that's a good idea, in that it exposes what is
> basically VM internals to userspace, which then would become a
> set-in-stone interface....

But I'm also not a big fan of moving all HA relevant infrastructure into
the kernel. Membership and DLM are the first ones; then follows
messaging (and reliable and globally ordered messaging is somewhat
complex - but if one node is slow, it will hurt global communication
too, so...), next someone argues that a node always must be able to
report which resources it holds and fence other nodes even under memory
pressure, and there goes the cluster resource manager and fencing
subsystem into the kernel too etc...

Where's the border? 

And what can we do to make critical user-space infrastructure run
reliably and with deterministic-enough & low latency instead of moving
it all into the kernel?

Yes, the kernel solves these problems right now, but is that really the
path we want to head down? Maybe it is, I'm not sure, afterall we also
have the entire regular network stack in the kernel, but maybe also it
is not.


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	    \ ever tried. ever failed. no matter.
SUSE Labs, Research and Development | try again. fail again. fail better.
SUSE LINUX AG - A Novell company    \ 	-- Samuel Beckett


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-12 13:13                                                   ` Lars Marowsky-Bree
@ 2004-07-12 13:40                                                     ` Nick Piggin
  2004-07-12 20:54                                                       ` Andrew Morton
  0 siblings, 1 reply; 55+ messages in thread
From: Nick Piggin @ 2004-07-12 13:40 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: Arjan van de Ven, Daniel Phillips, sdake, David Teigland,
	linux-kernel

Lars Marowsky-Bree wrote:
> On 2004-07-12T14:01:27,
>    Arjan van de Ven <arjanv@redhat.com> said:
> 
> 
>>I'm not convinced that's a good idea, in that it exposes what is
>>basically VM internals to userspace, which then would become a
>>set-in-stone interface....
> 
> 
> But I'm also not a big fan of moving all HA relevant infrastructure into
> the kernel. Membership and DLM are the first ones; then follows
> messaging (and reliable and globally ordered messaging is somewhat
> complex - but if one node is slow, it will hurt global communication
> too, so...), next someone argues that a node always must be able to
> report which resources it holds and fence other nodes even under memory
> pressure, and there goes the cluster resource manager and fencing
> subsystem into the kernel too etc...
> 
> Where's the border? 
> 
> And what can we do to make critical user-space infrastructure run
> reliably and with deterministic-enough & low latency instead of moving
> it all into the kernel?
> 
> Yes, the kernel solves these problems right now, but is that really the
> path we want to head down? Maybe it is, I'm not sure, afterall we also
> have the entire regular network stack in the kernel, but maybe also it
> is not.
> 

I don't see why it would be a problem to implement a "this task
facilitates page reclaim" flag for userspace tasks that would take
care of this as well as the kernel does.

There would probably be a few technical things to work out (like
GFP_NOFS), but I think it would be pretty trivial to implement.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-12  4:23                                     ` Daniel Phillips
@ 2004-07-12 18:21                                       ` Steven Dake
  2004-07-12 19:54                                         ` Daniel Phillips
  2004-07-13 20:06                                         ` Pavel Machek
  0 siblings, 2 replies; 55+ messages in thread
From: Steven Dake @ 2004-07-12 18:21 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Daniel Phillips, David Teigland, linux-kernel, Lars Marowsky-Bree

On Sun, 2004-07-11 at 21:23, Daniel Phillips wrote:
> On Monday 12 July 2004 00:08, Steven Dake wrote:
> > On Sun, 2004-07-11 at 12:44, Daniel Phillips wrote:
> > Oom conditions are another fact of life for poorly sized systems.  If
> > a cluster is within an OOM condition, it should be removed from the
> > cluster (because it is in overload, under which unknown and generally
> > bad behaviors occur).
> 
> You missed the point.  The memory deadlock I pointed out occurs in 
> _normal operation_.  You have to find a way around it, or kernel 
> cluster services win, plain and simple.
> 

The bottom line is that we just don't know if any such deadlock occurs,
under normal operations.  The remaining objections to in-kernel cluster
services give us alot of reason to test out a userland approach.

I propose after a distributed lock service is implemented in user space,
to add support for such a project into the gfs and remaining redhat
storage cluster services trees.  This will give us real data on
performance and reliability that we can't get by guessing.

Thanks
-steve


> > current code.  If at a later time the processor can reenter the
> > membership because it has freed up some memory, it will do so
> > correctly.
> 
> Think about it.  Do you want nodes spontaneously falling over from time 
> to time, even though nothing is wrong with them?  What does that do 
> your 5 nines?
> 
> > > > I would rather avoid non-mainline kernel dependencies at this
> > > > time as it makes adoption difficult until kernel patches are
> > > > merged into upstream code.  Who wants to patch their kernel to
> > > > try out some APIs?
> > >
> > > Everybody working on clusters.  It's a fact of life that you have
> > > to apply patches to run cluster filesystems right now.  Production
> > > will be a different story, but (except for the stable GFS code on
> > > 2.4) nobody is close to that.
> >
> > Perhaps people skilled in running pre-alpha software would consider
> > patching a kernel to "give it a run".  I have no doubts about that.
> >
> > I would posit a guess people interested in implementing production
> > clusters are not too interested about applying kernel patches (and
> > causing their kernel to become unsupported) to achieve clustering
> > support any time soon.
> 
> We are _far_ from production, at least on 2.6.  At this point, we are 
> only interested in people who like to code, test, tinker, and be the 
> first kid on the block with a shiny new storage cluster in their rec 
> room.  And by "we" I mean "you, me, and everybody else who hopes that 
> Linux will kick butt in clusters, in the 2.8 time frame."
> 
> > > > I am doubtful these sort of kernel patches will be merged without
> > > > a strong argument of why it absolutely must be implemented in the
> > > > kernel vs all of the counter arguments against a kernel
> > > > implementation.
> > >
> > > True.  Do you agree that the PF_MEMALLOC argument is a strong one?
> >
> > out of memory overload is a sucky situation poorly handled by any
> > software, kernel, userland, embedded, whatever.
> 
> In case you missed it above, please let me point out one more time that 
> I am not talking about OOM.  I'm talking about a deadlock that may come 
> up even when a resource usage is well within limits, which is inherent 
> in the basic design of Linux.  There is nothing Byzantine about it.
> 
> Regards,
> 
> Daniel


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-12 18:21                                       ` Steven Dake
@ 2004-07-12 19:54                                         ` Daniel Phillips
  2004-07-13 20:06                                         ` Pavel Machek
  1 sibling, 0 replies; 55+ messages in thread
From: Daniel Phillips @ 2004-07-12 19:54 UTC (permalink / raw)
  To: sdake; +Cc: Daniel Phillips, David Teigland, linux-kernel, Lars Marowsky-Bree

On Monday 12 July 2004 14:21, Steven Dake wrote:
> On Sun, 2004-07-11 at 21:23, Daniel Phillips wrote:
> > On Monday 12 July 2004 00:08, Steven Dake wrote:
> > > On Sun, 2004-07-11 at 12:44, Daniel Phillips wrote:
> > > Oom conditions are another fact of life for poorly sized systems.
> > > If a cluster is within an OOM condition, it should be removed
> > > from the cluster (because it is in overload, under which unknown
> > > and generally bad behaviors occur).
> >
> > You missed the point.  The memory deadlock I pointed out occurs in
> > _normal operation_.  You have to find a way around it, or kernel
> > cluster services win, plain and simple.
>
> The bottom line is that we just don't know if any such deadlock
> occurs, under normal operations.

I thought I demonstrated that, should I restate?  You need to point out 
the flaw in my argument (about the deadlock, not about philosophy). 
If/when you succeed, I will be pleased.  Until you do succeed, there's 
a deadlock.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-12 13:40                                                     ` Nick Piggin
@ 2004-07-12 20:54                                                       ` Andrew Morton
  2004-07-13  2:19                                                         ` Daniel Phillips
  2004-07-14 12:19                                                         ` Pavel Machek
  0 siblings, 2 replies; 55+ messages in thread
From: Andrew Morton @ 2004-07-12 20:54 UTC (permalink / raw)
  To: Nick Piggin; +Cc: lmb, arjanv, phillips, sdake, teigland, linux-kernel

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> I don't see why it would be a problem to implement a "this task
> facilitates page reclaim" flag for userspace tasks that would take
> care of this as well as the kernel does.

Yes, that has been done before, and it works - userspace "block drivers"
which permanently mark themselves as PF_MEMALLOC to avoid the obvious
deadlocks.

Note that you can achieve a similar thing in current 2.6 by acquiring
realtime scheduling policy, but that's an artifact of some brainwave which
a VM hacker happened to have and isn't a thing which should be relied upon.

A privileged syscall which allows a task to mark itself as one which
cleans memory would make sense.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-12 20:54                                                       ` Andrew Morton
@ 2004-07-13  2:19                                                         ` Daniel Phillips
  2004-07-13  2:31                                                           ` Nick Piggin
  2004-07-14 12:19                                                         ` Pavel Machek
  1 sibling, 1 reply; 55+ messages in thread
From: Daniel Phillips @ 2004-07-13  2:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, lmb, arjanv, sdake, teigland, linux-kernel

Hi Andrew,

On Monday 12 July 2004 16:54, Andrew Morton wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > I don't see why it would be a problem to implement a "this task
> > facilitates page reclaim" flag for userspace tasks that would take
> > care of this as well as the kernel does.
>
> Yes, that has been done before, and it works - userspace "block drivers"
> which permanently mark themselves as PF_MEMALLOC to avoid the obvious
> deadlocks.
>
> Note that you can achieve a similar thing in current 2.6 by acquiring
> realtime scheduling policy, but that's an artifact of some brainwave which
> a VM hacker happened to have and isn't a thing which should be relied upon.

Do you have a pointer to the brainwave?

> A privileged syscall which allows a task to mark itself as one which
> cleans memory would make sense.

For now we can do it with an ioctl, and we pretty much have to do it for 
pvmove.  But that's when user space drives the kernel by syscalls; there is 
also the nasty (and common) case where the kernel needs userspace to do 
something for it while it's in PF_MEMALLOC.  I'm playing with ideas there, 
but nothing I'm proud of yet.  For now I see the in-kernel approach as the 
conservative one, for anything that could possibly find itself on the VM 
writeout path.

Unfortunately, that may include some messy things like authentication.  I'd 
really like to solve this reliable-userspace problem.  We'd still have lots 
of arguments left to resolve about where things should be, but at least we'd 
have the choice.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-13  2:19                                                         ` Daniel Phillips
@ 2004-07-13  2:31                                                           ` Nick Piggin
  2004-07-27  3:31                                                             ` Daniel Phillips
  0 siblings, 1 reply; 55+ messages in thread
From: Nick Piggin @ 2004-07-13  2:31 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrew Morton, lmb, arjanv, sdake, teigland, linux-kernel

Daniel Phillips wrote:
> Hi Andrew,
> 
> On Monday 12 July 2004 16:54, Andrew Morton wrote:
> 
>>Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>>
>>>I don't see why it would be a problem to implement a "this task
>>>facilitates page reclaim" flag for userspace tasks that would take
>>>care of this as well as the kernel does.
>>
>>Yes, that has been done before, and it works - userspace "block drivers"
>>which permanently mark themselves as PF_MEMALLOC to avoid the obvious
>>deadlocks.
>>
>>Note that you can achieve a similar thing in current 2.6 by acquiring
>>realtime scheduling policy, but that's an artifact of some brainwave which
>>a VM hacker happened to have and isn't a thing which should be relied upon.
> 
> 
> Do you have a pointer to the brainwave?
> 

Search for rt_task in mm/page_alloc.c

> 
>>A privileged syscall which allows a task to mark itself as one which
>>cleans memory would make sense.
> 
> 
> For now we can do it with an ioctl, and we pretty much have to do it for 
> pvmove.  But that's when user space drives the kernel by syscalls; there is 
> also the nasty (and common) case where the kernel needs userspace to do 
> something for it while it's in PF_MEMALLOC.  I'm playing with ideas there, 
> but nothing I'm proud of yet.  For now I see the in-kernel approach as the 
> conservative one, for anything that could possibly find itself on the VM 
> writeout path.
> 

You'd obviously want to make the PF_MEMALLOC task as tight as possible,
and running mlocked: I don't particularly see why such a task would be
any safer in-kernel.

PF_MEMALLOC tasks won't enter page reclaim at all. The only way they
will reach the writeout path is if you are write(2)ing stuff (you may
hit synch writeout).

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-12 18:21                                       ` Steven Dake
  2004-07-12 19:54                                         ` Daniel Phillips
@ 2004-07-13 20:06                                         ` Pavel Machek
  1 sibling, 0 replies; 55+ messages in thread
From: Pavel Machek @ 2004-07-13 20:06 UTC (permalink / raw)
  To: Steven Dake
  Cc: Daniel Phillips, Daniel Phillips, David Teigland, linux-kernel,
	Lars Marowsky-Bree

Hi!

> > You missed the point.  The memory deadlock I pointed out occurs in 
> > _normal operation_.  You have to find a way around it, or kernel 
> > cluster services win, plain and simple.
> > 
> 
> The bottom line is that we just don't know if any such deadlock occurs,
> under normal operations.  The remaining objections to in-kernel cluster

I did some work on swapping-over-nbd, which has similar issues,
and yes, the deadlocks were seen under heavy load.

*Designing* something with "lets hope it does not deadlock",
while deadlock clearly can be triggered, looks like bad idea.
-- 
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms         


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-12 10:21                                           ` Lars Marowsky-Bree
  2004-07-12 10:28                                             ` Arjan van de Ven
@ 2004-07-14  8:32                                             ` Pavel Machek
  1 sibling, 0 replies; 55+ messages in thread
From: Pavel Machek @ 2004-07-14  8:32 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: Arjan van de Ven, Daniel Phillips, sdake, David Teigland,
	linux-kernel

Hi!


> However, of course this is more difficult for the case where you are in
> the write path needed to free some memory; alas, swapping to a GFS mount
> is probably a realllllly silly idea, too.

Swapping to GFS mount is *very* similar. If swapping to GFS can
not work, it is unlikely write support will be reliable.

-- 
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms         


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-12 20:54                                                       ` Andrew Morton
  2004-07-13  2:19                                                         ` Daniel Phillips
@ 2004-07-14 12:19                                                         ` Pavel Machek
  2004-07-15  2:19                                                           ` Nick Piggin
  1 sibling, 1 reply; 55+ messages in thread
From: Pavel Machek @ 2004-07-14 12:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, lmb, arjanv, phillips, sdake, teigland, linux-kernel

Hi!

> > I don't see why it would be a problem to implement a "this task
> > facilitates page reclaim" flag for userspace tasks that would take
> > care of this as well as the kernel does.
> 
> Yes, that has been done before, and it works - userspace "block drivers"
> which permanently mark themselves as PF_MEMALLOC to avoid the obvious
> deadlocks.

> Note that you can achieve a similar thing in current 2.6 by acquiring
> realtime scheduling policy, but that's an artifact of some brainwave which
> a VM hacker happened to have and isn't a thing which should be relied upon.
> 
> A privileged syscall which allows a task to mark itself as one which
> cleans memory would make sense.

Does it work?

I mean, in kernel, we have some memory cleaners (say 5), and they
need, say, 1MB total reserved memory.

Now, if you add another task with PF_MEMALLOC. But now you'd need
1.2MB reserved memory, and you only have 1MB. Things are obviously
going to break at some point.
								Pavel
-- 
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-14 12:19                                                         ` Pavel Machek
@ 2004-07-15  2:19                                                           ` Nick Piggin
  2004-07-15 12:03                                                             ` Marcelo Tosatti
  0 siblings, 1 reply; 55+ messages in thread
From: Nick Piggin @ 2004-07-15  2:19 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Andrew Morton, lmb, arjanv, phillips, sdake, teigland,
	linux-kernel

Pavel Machek wrote:
> Hi!
> 
> 
>>>I don't see why it would be a problem to implement a "this task
>>>facilitates page reclaim" flag for userspace tasks that would take
>>>care of this as well as the kernel does.
>>
>>Yes, that has been done before, and it works - userspace "block drivers"
>>which permanently mark themselves as PF_MEMALLOC to avoid the obvious
>>deadlocks.
> 
> 
>>Note that you can achieve a similar thing in current 2.6 by acquiring
>>realtime scheduling policy, but that's an artifact of some brainwave which
>>a VM hacker happened to have and isn't a thing which should be relied upon.
>>
>>A privileged syscall which allows a task to mark itself as one which
>>cleans memory would make sense.
> 
> 
> Does it work?
> 
> I mean, in kernel, we have some memory cleaners (say 5), and they
> need, say, 1MB total reserved memory.
> 
> Now, if you add another task with PF_MEMALLOC. But now you'd need
> 1.2MB reserved memory, and you only have 1MB. Things are obviously
> going to break at some point.
> 								Pavel

Well you'd have to be more careful than that. In particular
you wouldn't just be starting these things up, let alone
have them allocate 1MB in to free some memory.

This situation would still blow up whether you did it in
kernel or not.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-15  2:19                                                           ` Nick Piggin
@ 2004-07-15 12:03                                                             ` Marcelo Tosatti
  0 siblings, 0 replies; 55+ messages in thread
From: Marcelo Tosatti @ 2004-07-15 12:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pavel Machek, Andrew Morton, lmb, arjanv, phillips, sdake,
	teigland, linux-kernel

On Thu, Jul 15, 2004 at 12:19:12PM +1000, Nick Piggin wrote:
> Pavel Machek wrote:
> >Hi!
> >
> >
> >>>I don't see why it would be a problem to implement a "this task
> >>>facilitates page reclaim" flag for userspace tasks that would take
> >>>care of this as well as the kernel does.
> >>
> >>Yes, that has been done before, and it works - userspace "block drivers"
> >>which permanently mark themselves as PF_MEMALLOC to avoid the obvious
> >>deadlocks.

Andrew, as curiosity, what userspace "block driver" sets PF_MEMALLOC for
normal operation?

> >>Note that you can achieve a similar thing in current 2.6 by acquiring
> >>realtime scheduling policy, but that's an artifact of some brainwave which
> >>a VM hacker happened to have and isn't a thing which should be relied 
> >>upon.
> >>
> >>A privileged syscall which allows a task to mark itself as one which
> >>cleans memory would make sense.
> >
> >
> >Does it work?
> >
> >I mean, in kernel, we have some memory cleaners (say 5), and they
> >need, say, 1MB total reserved memory.
> >
> >Now, if you add another task with PF_MEMALLOC. But now you'd need
> >1.2MB reserved memory, and you only have 1MB. Things are obviously
> >going to break at some point.
> >								Pavel
> 
> Well you'd have to be more careful than that. In particular
> you wouldn't just be starting these things up, let alone
> have them allocate 1MB in to free some memory.
> 
> This situation would still blow up whether you did it in
> kernel or not.

Indeed, such PF_MEMALLOC app can probably kill the system if it bugs
allocating lots of memory from the lower reservations. It needs
some limitation. 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-13  2:31                                                           ` Nick Piggin
@ 2004-07-27  3:31                                                             ` Daniel Phillips
  2004-07-27  4:07                                                               ` Nick Piggin
  0 siblings, 1 reply; 55+ messages in thread
From: Daniel Phillips @ 2004-07-27  3:31 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, lmb, arjanv, sdake, teigland, linux-kernel

On Monday 12 July 2004 22:31, Nick Piggin wrote:
> Daniel Phillips wrote:
> > On Monday 12 July 2004 16:54, Andrew Morton wrote:
> >>Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> >>>I don't see why it would be a problem to implement a "this task
> >>>facilitates page reclaim" flag for userspace tasks that would take
> >>>care of this as well as the kernel does.
> >>
> >>Yes, that has been done before, and it works - userspace "block drivers"
> >>which permanently mark themselves as PF_MEMALLOC to avoid the obvious
> >>deadlocks.
> >>
> >>Note that you can achieve a similar thing in current 2.6 by acquiring
> >>realtime scheduling policy, but that's an artifact of some brainwave
> >> which a VM hacker happened to have and isn't a thing which should be
> >> relied upon.
> >
> > Do you have a pointer to the brainwave?
>
> Search for rt_task in mm/page_alloc.c

Ah, interesting idea: realtime tasks get to dip into the PF_MEMALLOC reserve, 
until it gets down to some threshold, then they have to give up and wait like 
any other unwashed nobody of a process.  _But_ if there's a user space 
process sitting in the writeout path and some other realtime process eats the 
entire realtime reserve, everything can still grind to a halt.

So it's interesting for realtime, but does not solve the userspace PF_MEMALLOC 
inversion.

> >>A privileged syscall which allows a task to mark itself as one which
> >>cleans memory would make sense.
> >
> > For now we can do it with an ioctl, and we pretty much have to do it for
> > pvmove.  But that's when user space drives the kernel by syscalls; there
> > is also the nasty (and common) case where the kernel needs userspace to
> > do something for it while it's in PF_MEMALLOC.  I'm playing with ideas
> > there, but nothing I'm proud of yet.  For now I see the in-kernel
> > approach as the conservative one, for anything that could possibly find
> > itself on the VM writeout path.
>
> You'd obviously want to make the PF_MEMALLOC task as tight as possible,
> and running mlocked:

Not just tight, but bounded.  And tight too, of course.

> I don't particularly see why such a task would be any safer in-kernel.

The PF_MEMALLOC flag is inherited down a call chain, not across a pipe or 
similar IPC to user space.

> PF_MEMALLOC tasks won't enter page reclaim at all. The only way they
> will reach the writeout path is if you are write(2)ing stuff (you may
> hit synch writeout).

That's the problem.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-27  3:31                                                             ` Daniel Phillips
@ 2004-07-27  4:07                                                               ` Nick Piggin
  2004-07-27  5:57                                                                 ` Daniel Phillips
  0 siblings, 1 reply; 55+ messages in thread
From: Nick Piggin @ 2004-07-27  4:07 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrew Morton, lmb, arjanv, sdake, teigland, linux-kernel

Daniel Phillips wrote:

>On Monday 12 July 2004 22:31, Nick Piggin wrote:
>
>>
>>Search for rt_task in mm/page_alloc.c
>>
>
>Ah, interesting idea: realtime tasks get to dip into the PF_MEMALLOC reserve, 
>until it gets down to some threshold, then they have to give up and wait like 
>any other unwashed nobody of a process.  _But_ if there's a user space 
>process sitting in the writeout path and some other realtime process eats the 
>entire realtime reserve, everything can still grind to a halt.
>
>So it's interesting for realtime, but does not solve the userspace PF_MEMALLOC 
>inversion.
>
>

Not the rt_task thing, because yes, you can have other RT tasks that aren't
small and bounded that screw up your reserves.

But a PF_MEMALLOC userspace task is still useful.

>>>>A privileged syscall which allows a task to mark itself as one which
>>>>cleans memory would make sense.
>>>>
>>>For now we can do it with an ioctl, and we pretty much have to do it for
>>>pvmove.  But that's when user space drives the kernel by syscalls; there
>>>is also the nasty (and common) case where the kernel needs userspace to
>>>do something for it while it's in PF_MEMALLOC.  I'm playing with ideas
>>>there, but nothing I'm proud of yet.  For now I see the in-kernel
>>>approach as the conservative one, for anything that could possibly find
>>>itself on the VM writeout path.
>>>
>>You'd obviously want to make the PF_MEMALLOC task as tight as possible,
>>and running mlocked:
>>
>
>Not just tight, but bounded.  And tight too, of course.
>
>
>>I don't particularly see why such a task would be any safer in-kernel.
>>
>
>The PF_MEMALLOC flag is inherited down a call chain, not across a pipe or 
>similar IPC to user space.
>
>
This is no different in kernel of course. You would have to think about
which threads need the flag and which do not. Even better, you might
aquire and drop the flag only when required. I can't see any obvious
problems you would run into.

>>PF_MEMALLOC tasks won't enter page reclaim at all. The only way they
>>will reach the writeout path is if you are write(2)ing stuff (you may
>>hit synch writeout).
>>
>
>That's the problem.
>
>

Well I don't think it would be a problem to get the write throttling path
to ignore PF_MEMALLOC tasks if that is what you need. Again, this shouldn't
be any different to in kernel code.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30
  2004-07-27  4:07                                                               ` Nick Piggin
@ 2004-07-27  5:57                                                                 ` Daniel Phillips
  0 siblings, 0 replies; 55+ messages in thread
From: Daniel Phillips @ 2004-07-27  5:57 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, lmb, arjanv, sdake, teigland, linux-kernel

On Tuesday 27 July 2004 00:07, Nick Piggin wrote:
> But a PF_MEMALLOC userspace task is still useful.

Absolutely.  This is the route I'm taking, and I just use an ioctl to flip the 
task bit as I mentioned (much) earlier.  It still needs to be beaten up in 
practice.  The cluster snapshot block device, which has a relatively complex 
userspace server, should be a nice test case.

> >The PF_MEMALLOC flag is inherited down a call chain, not across a pipe or
> >similar IPC to user space.
>
> This is no different in kernel of course.

I was talking about in-kernel.  Once we let the PF_MEMALLOC state escape to 
user space, things start looking brighter.  But you still have to invoke that 
userspace code somehow, and there is no direct way to do it, hence 
PF_MEMALLOC isn't inherited.  An easy solution is to have a userspace daemon 
that's always in PF_MEMALLOC state, as Andrew mentioned, which we can control 
via a pipe or similar.

> You would have to think about 
> which threads need the flag and which do not. Even better, you might
> aquire and drop the flag only when required.

Yes, that's what the ioctl is about.  However, this doesn't work for servicing 
writeout.

> I can't see any obvious problems you would run into.

;-)

Regards,

Daniel

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2004-07-27  5:56 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-07-05  6:09 [ANNOUNCE] Minneapolis Cluster Summit, July 29-30 Daniel Phillips
2004-07-05 15:09 ` Christoph Hellwig
2004-07-05 18:42   ` Daniel Phillips
2004-07-05 19:08     ` Chris Friesen
2004-07-05 20:29       ` Daniel Phillips
2004-07-07 22:55         ` Steven Dake
2004-07-08  1:30           ` Daniel Phillips
2004-07-05 19:12     ` Lars Marowsky-Bree
2004-07-05 20:27       ` Daniel Phillips
2004-07-06  7:34         ` Lars Marowsky-Bree
2004-07-06 21:34           ` Daniel Phillips
2004-07-07 18:16             ` Lars Marowsky-Bree
2004-07-08  1:14               ` Daniel Phillips
2004-07-08  9:10                 ` Lars Marowsky-Bree
2004-07-08 10:53                   ` David Teigland
2004-07-08 14:14                     ` Chris Friesen
2004-07-08 16:06                       ` David Teigland
2004-07-08 18:22                     ` Daniel Phillips
2004-07-08 19:41                       ` Steven Dake
2004-07-10  4:58                         ` David Teigland
2004-07-10  4:58                         ` Daniel Phillips
2004-07-10 17:59                           ` Steven Dake
2004-07-10 20:57                             ` Daniel Phillips
2004-07-10 23:24                               ` Steven Dake
2004-07-11 19:44                                 ` Daniel Phillips
2004-07-11 21:06                                   ` Lars Marowsky-Bree
2004-07-12  6:58                                     ` Arjan van de Ven
2004-07-12 10:05                                       ` Lars Marowsky-Bree
2004-07-12 10:11                                         ` Arjan van de Ven
2004-07-12 10:21                                           ` Lars Marowsky-Bree
2004-07-12 10:28                                             ` Arjan van de Ven
2004-07-12 11:50                                               ` Lars Marowsky-Bree
2004-07-12 12:01                                                 ` Arjan van de Ven
2004-07-12 13:13                                                   ` Lars Marowsky-Bree
2004-07-12 13:40                                                     ` Nick Piggin
2004-07-12 20:54                                                       ` Andrew Morton
2004-07-13  2:19                                                         ` Daniel Phillips
2004-07-13  2:31                                                           ` Nick Piggin
2004-07-27  3:31                                                             ` Daniel Phillips
2004-07-27  4:07                                                               ` Nick Piggin
2004-07-27  5:57                                                                 ` Daniel Phillips
2004-07-14 12:19                                                         ` Pavel Machek
2004-07-15  2:19                                                           ` Nick Piggin
2004-07-15 12:03                                                             ` Marcelo Tosatti
2004-07-14  8:32                                             ` Pavel Machek
2004-07-12  4:08                                   ` Steven Dake
2004-07-12  4:23                                     ` Daniel Phillips
2004-07-12 18:21                                       ` Steven Dake
2004-07-12 19:54                                         ` Daniel Phillips
2004-07-13 20:06                                         ` Pavel Machek
2004-07-12 10:14                     ` Lars Marowsky-Bree
     [not found] <fa.io9lp90.1c02foo@ifi.uio.no>
     [not found] ` <fa.go9f063.1i72joh@ifi.uio.no>
2004-07-06  6:39   ` Aneesh Kumar K.V
  -- strict thread matches above, loose matches on Subject: below --
2004-07-10 14:58 James Bottomley
2004-07-10 16:04 ` David Teigland
2004-07-10 16:26   ` James Bottomley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox