Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-08-09 15:57         ` Hubertus Franke
@ 2004-08-10 11:31           ` Paul Jackson
  2004-08-10 22:38             ` Shailabh Nagar
  0 siblings, 1 reply; 53+ messages in thread
From: Paul Jackson @ 2004-08-10 11:31 UTC (permalink / raw)
  To: Hubertus Franke
  Cc: nagar, efocht, mbligh, lse-tech, akpm, hch, steiner, jbarnes,
	sylvain.jeaugey, djh, linux-kernel, colpatch, Simon.Derr, ak,
	sivanich, ckrm-tech

I've been puzzling over the relationship of cpusets and CKRM the last
few days, unable to understand how they relate, or how either could
make much use of the other.

Others have noticed they both have a hierarchy, and are both concerned
with managing resources in some sense.  Hence more than one person has
suspected opportunities for closer integration of the two projects,
indeed, hoped for such opportunities, given that neither code base
has a reputation for being small.  Though, to be fair to CKRM, they
have substantial more code invested.  Outside of the cpusets.txt file
in Documentation, the cpuset patch is under 2000 lines involving 13
files, whereas a quick count of the June 2004 e13 ckrm and related
cpu patches shows over 15,000 lines involving 62 files.

Someone has suggested that we shouldn't accept the particular names
and directory structure of cpusets into the kernel until we understand
how this interacts with CKRM, because things like this are hard to
change once put in use, and CKRM might impose or at least recommend
different names or such.

The more I look, the more convinced I become that these two projects
are separate, in means and goals, with little interaction and less
opportunty for either to leverage the other.  Neither project should
be contingent on the other.

Warning:
	No one should take anything that follows as actually
	describing CKRM.  I can find statements on the CKRM web
	pages directly contradicting what I state, and I am certain
	that I'm somewhat to substantially confused.  I'll just go
	ahead and boldly describe CKRM as I currently understand it,
	in the hopes that someone knowledgeable in the project will
	thus more easily see my errors and offer corrections.

Here is my current understanding of cpusets and CKRM, and how they
differ.

Cpusets - Static Isolation:

    The essential purpose of cpusets is to support isolating large,
    long-running, multinode compute bound HPC (high performance
    computing) applications or relatively independent service jobs,
    on dedicated sets of processor and memory nodes.

    The (unobtainable) ideal of cpusets is to provide perfect
    isolation, for such jobs as:

     1) Massive compute jobs that might run hours or days, on dozens
	or hundreds of processors, consuming gigabytes or terabytes
	of main memory.  These jobs are often highly parallel, and
	carefully sized and placed to obtain maximum performance
	on NUMA hardware, where memory placement and bandwidth is
	critical.

     2) Independent services for which dedicated compute resources
        have been purchased or allocated, in units of one or more
	CPUs and Memory Nodes, such as a web server and a DBMS
	sharing a large system, but staying out of each others way.

    The essential new construct of cpusets is the set of dedicated
    compute resources - some processors and memory.  These sets have
    names, permissions, an exclusion property, and can be subdivided
    into subsets.

    The cpuset file system models a hierarchy of 'virtual computers',
    which hierarchy will be deeper on larger systems.

    The average lifespan of a cpuset used for (1) above is probably
    between hours and days, based on the job lifespan, though a couple
    of system cpusets will remain in place as long as the system is
    running.  The cpusets in (2) above might have a longer lifespan;
    you'd have to ask Simon Derr of Bull about that.

CKRM - Dynamic Sharing:

    My current, probably confused, understanding is that the purpose
    of CKRM is to enable managing different Qualities of Service, or
    "Classes" (*) on streams of transactions, queries, jobs, tasks that
    are sharing the same compute resources.  Even if there is some
    big honking service process such as an enterprise DBMS running,
    the point of CKRM is not focused on optimizing the overall
    performance of that job, but rather on distinguishing between
    various transactions flowing through the system, determining the
    quality of service (Class) allowed for each, measuring critical
    resource usage for each Class, and biasing resource allocation
    decisions, such as in the scheduler and allocator, to obtain the
    desired balance of resource usage between Classes, or the desired
    response time to particular favored Classes.

    This is certainly a more challenging objective than cpusets,
    in that it requires (1) tracking resource usage (cpu cycles,
    memory pages, i/o bandwidth) by Class, (2) assigning a Class to
    transactions moving through the system, and imputing that Class to
    the tasks handling each transaction, and (3) dynamically biasing
    scheduling and allocation decisions so as to affect the desired
    Quality of Service policies.

    The essential new construct of CKRM is the Class - a Quality
    of Service level.  Metrics, transactions, tasks, and resource
    decisions all have to be tracked or managed by Class.

    These Classes form a fairly shallow hierarchy of usage levels or
    service qualities, as perceived by the end users of the system.

    I'd guess that the average lifetime of a Class is months or years,
    as they can reflect the relative priority of relations with long
    standing, external customers.

Cpusets and CKRM have profoundly different purposes, economics and
motivations.

For one thing, the cpuset hierarchy and the class hierarchy are two
different things.  One provides semi-static collections of compute
resources, which I sometimes call virtual computers or soft partitions.
The other reflects the differing qualities of service which you find
it worth providing the originators of transactions into your system.
These have about as much to do with each other as the "Program Files"
on my sons game machine has to do with Linus' home directory.  Yup -
they're both representable in file system trees ;).

I see no value other than obfuscation to attempting to represent
either hierarchy in terms of the other.

One of the valuable parts of my cpuset proposal is that the cpuset
file system reflects the allocation of cpu and memory nodes to
cpusets in a visible and obvious fashion, and thanks to the Linux
vfs infrastructure, provides the customary file system hierarchy and
permission model with little additional cpuset code.  Cpusets have
user (administrator) provided pathnames, in a file system hierarchy,
with the usual and expected vfs support.  And the filenames (mems,
cpus, tasks, ...)  within each cpuset directory have a relevance that
should be preserved.  I don't see any value that the CKRM hierarchy
mechanisms, naming or semantics bring to that.

For another way to put the difference, CKRM is managing "commodity"
resources, such as cycles and bits.  One cycle is as good as the
next; it's just a question of who gets how many.  On the other hand,
cpusets manage precious named resources - such as an entire block
of 64 CPUs and associated memory on a 256 CPU system.  Each such
cpuset is a unique, named, first class, relatively long lasting
entity represented by its own directory in the cpuset file system,
and assigned a specific well known job to execute.

So what interaction or relationship if any do I see between cpusets
and CKRM?  Only one at the moment.  A major job running within a
long lasting cpuset might well want to make use of CKRM in order to
provide refined Qualities of Service to its clients.  This means that
the CKRM instance would need to understand that it's not managing
the entire physical system, but just some cpuset-defined subset.

A few days ago, one of the CKRM gurus encouraged me to look forward
to providing a CKRM controller for cpusets.  At the time, I nodded
knowingly at my screen, as if that all made sense.

Now, I've no clue what such a controller would be or do, or why anyone
would want one.

I look forward to having my likely serious confusions over CKRM
corrected.  Meanwhile, I remain convinced that cpusets and CKRM are
separate and distinct projects, and that neither should wait for
the other.

I continue to recommend that cpusets be accepted into the 2.6.9 mm
patches, and if that goes well, into Linus' tree.

Thank-you for reading.

    (*) The above description of a Class as a Quality of Service
        does _not_ match the phrase on http://ckrm.sourcefourge.net:
	    "A class is a group of Linux tasks (processes), ..."
	I'm speculating that this phrase is misleading.  More
	likely, it's just that I'm confused ;).

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-08-10 11:31           ` [ckrm-tech] " Paul Jackson
@ 2004-08-10 22:38             ` Shailabh Nagar
  2004-08-11 10:42               ` Erich Focht
  2004-08-14  8:51               ` Paul Jackson
  0 siblings, 2 replies; 53+ messages in thread
From: Shailabh Nagar @ 2004-08-10 22:38 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Hubertus Franke, efocht, mbligh, lse-tech, akpm, hch, steiner,
	jbarnes, sylvain.jeaugey, djh, linux-kernel, colpatch, Simon.Derr,
	ak, sivanich, ckrm-tech

Paul Jackson wrote:

> 
> The more I look, the more convinced I become that these two projects
> are separate, in means and goals, with little interaction and less
> opportunty for either to leverage the other.  Neither project should
> be contingent on the other.

> 
> Warning:
> 	No one should take anything that follows as actually
> 	describing CKRM.  I can find statements on the CKRM web
> 	pages directly contradicting what I state, and I am certain
> 	that I'm somewhat to substantially confused.  I'll just go
> 	ahead and boldly describe CKRM as I currently understand it,
> 	in the hopes that someone knowledgeable in the project will
> 	thus more easily see my errors and offer corrections.

> 
> Here is my current understanding of cpusets and CKRM, and how they
> differ.
> 
> Cpusets - Static Isolation:
> 
>     The essential purpose of cpusets is to support isolating large,
>     long-running, multinode compute bound HPC (high performance
>     computing) applications or relatively independent service jobs,
>     on dedicated sets of processor and memory nodes.

CKRM's overall objective is to isolate the performance of a group of 
kernel objects from other groups. The grouping can be static 
(applications, users, etc.) or dynamic (processes of the same app can 
change membership from one group to another).

The group of objects is what we call a class.

The apparent dichotomy between what you describe and what we manage is 
resolved when you consider that all applications/users etc. finally boil 
down to some set of tasks making resource demands of cpu, mem, io etc.

Basically we have a flexible way of defining a group of tasks - what 
that group maps to in user space doesn't matter inside the kernel when
resource allocations are being done.

>     
>     The (unobtainable) ideal of cpusets is to provide perfect
>     isolation, for such jobs as:
> 
>      1) Massive compute jobs that might run hours or days, on dozens
> 	or hundreds of processors, consuming gigabytes or terabytes
> 	of main memory.  These jobs are often highly parallel, and
> 	carefully sized and placed to obtain maximum performance
> 	on NUMA hardware, where memory placement and bandwidth is
> 	critical.
> 
>      2) Independent services for which dedicated compute resources
>         have been purchased or allocated, in units of one or more
> 	CPUs and Memory Nodes, such as a web server and a DBMS
> 	sharing a large system, but staying out of each others way.
> 
>     The essential new construct of cpusets is the set of dedicated
>     compute resources - some processors and memory.  These sets have
>     names, permissions, an exclusion property, and can be subdivided
>     into subsets.

The only difference between CKRM and cpusets in the paragraphs above is 
that cpusets tries to achieve the isolation by a static partitioning of 
physical cpus and mem nodes. CKRM does so in terms of cpu time and 
memory pages.

> 
>     The cpuset file system models a hierarchy of 'virtual computers',
>     which hierarchy will be deeper on larger systems.
> 
>     The average lifespan of a cpuset used for (1) above is probably
>     between hours and days, based on the job lifespan, though a couple
>     of system cpusets will remain in place as long as the system is
>     running.  The cpusets in (2) above might have a longer lifespan;
>     you'd have to ask Simon Derr of Bull about that.

CKRM class lifespans depend on how the classes are defined by the 
sysadmin or delegated users. Classes representing users will last as 
long as the system is up, those representing a particular application 
will last as long as the app (typically - CKRM doesn't autodelete 
classes - the user who created it needs to do it himself).

> 
> CKRM - Dynamic Sharing:
> 
>     My current, probably confused, understanding is that the purpose
>     of CKRM is to enable managing different Qualities of Service, or
>     "Classes" (*) on streams of transactions, queries, jobs, tasks that
>     are sharing the same compute resources. 

It would be easier to think of classes as a grouping of tasks and 
sockets which, as an aggregate, have some share of each resource managed 
by CKRM. A class is not characterized by the QoS level, but the objects 
it groups. In particular, two classes can have the same QoS level (e.g. 
20% of total cpu time) and the same class can have its QoS level changed 
(from 20% to say 40%).

> Even if there is some
>     big honking service process such as an enterprise DBMS running,
>     the point of CKRM is not focused on optimizing the overall
>     performance of that job, but rather on distinguishing between
>     various transactions flowing through the system, determining the
>     quality of service (Class) allowed for each, measuring critical
>     resource usage for each Class, and biasing resource allocation
>     decisions, such as in the scheduler and allocator, to obtain the
>     desired balance of resource usage between Classes, or the desired
>     response time to particular favored Classes.

Managing the QoS of transactions (which tend to cross task/application 
boundaries) is a complicated use of CKRM which tries to exploit its 
support for flexible and dynamic grouping. Doing this requires some 
degree of application cooperation (it is only the app which can tell 
what transaction it is processing).

However transaction QoS management is not what CKRM, the kernel project, 
is doing. Its most commonly expected usage is to isolate the performance 
of one application from another or one user from another. Doing this is 
far easier than transactions since apps and users map to tasks/sockets
in easily understood ways that do not require any cooperation from the 
app/user (indeed we don't want any "cooperation" or "interference" from 
them !)

> 
>     This is certainly a more challenging objective than cpusets,
>     in that it requires (1) tracking resource usage (cpu cycles,
>     memory pages, i/o bandwidth) by Class, (2) assigning a Class to
>     transactions moving through the system, and imputing that Class to
>     the tasks handling each transaction, and (3) dynamically biasing
>     scheduling and allocation decisions so as to affect the desired
>     Quality of Service policies.

Correct, CKRM does have more work to do than cpusets does, since it 
controls more fine-grained resources than cpusets (cpu time vs cpus, mem 
pages vs. nodes).

However, it does get a lot of help from the system and does not have to 
carry the burden of 1) and 3) all by itself. 1) only requires existing 
resource usage data (cpu time consumed by a process) to be aggregated, 
additionally, into class statistics. 3) too can be done as an increment 
over existing schedulers, not a replacement. In case of the CPU, it 
means picking the next class to run and then choosing the next task to 
run. In mem, it means preferentially picking pages from an "over share" 
class to swap out etc.

>     
>     The essential new construct of CKRM is the Class - a Quality
>     of Service level.

As said above, this is not the right way to think of a class. Think 
groupings ! The Quality of Service level is an attribute of a class,
not its defining characteristic.

>  Metrics, transactions, tasks, and resource
>     decisions all have to be tracked or managed by Class.
> 
>     These Classes form a fairly shallow hierarchy of usage levels or
>     service qualities, as perceived by the end users of the system.
> 
>     I'd guess that the average lifetime of a Class is months or years,
>     as they can reflect the relative priority of relations with long
>     standing, external customers.
> 
> Cpusets and CKRM have profoundly different purposes, economics and
> motivations.

I would say the methods differ, not the purpose. Both are trying to 
performance-isolate groups of tasks - one uses the spatial dimension of 
cpu bindings, the other uses  the temporal dimension of cpu time.

> 
> For one thing, the cpuset hierarchy and the class hierarchy are two
> different things.  One provides semi-static collections of compute
> resources, which I sometimes call virtual computers or soft partitions.
> The other reflects the differing qualities of service which you find
> it worth providing the originators of transactions into your system.
> These have about as much to do with each other as the "Program Files"
> on my sons game machine has to do with Linus' home directory.  Yup -
> they're both representable in file system trees ;).

Again, I would disagree. The filesytem hierarchies of cpusets and CKRM 
have quite a few things in common.
- directories representing the grouping of tasks
- hierarchical subdivision aka a child can only subdivide what its 
parent has. In CKRM, only the % share that a parent gets from the system 
is further divisible amongst child classes. In cpusets, that resource 
happens to be the set of cpus_allowed.
- delegation of control through file permissions : both allow non-root 
users to control their resource allocations.
- binding of tasks to a group by writing pids to a special virtual file

> 
> I see no value other than obfuscation to attempting to represent
> either hierarchy in terms of the other.

Notwithstanding the similarities between the hierarchies listed above, 
this danger of obfuscation is a possibility. The reason for that is that 
our interface within the filesystem, as defined by the virtual files and 
the attributes within them that one can read and write, do not map 
cleanly onto the ones exported by cpusets.

e.g. the notions of stats, lower and upper bounds for shares that CKRM 
needs, are not relevant to cpusets. On the other hand, we do allow 
attributes that are controller-specific to be represented within some 
virtual files and cpusets could use that.

The other point of difference is the one you'd brought up earlier - ther 
restrictions on the hierarchy creation. CKRM has none (effectively), 
cpusets has many.

As CKRM's interface stands today, there are sufficient differences 
between the interfaces to keep them separate.

However, if CKRM moves to a model where
- each controller is allowed to define its own virtual files and attributes
- each controllers has its own hierarchy (and hence more control over 
how it can be formed),
then the similarities will be too many to ignore merger possibilities
altogether.

The kicker is, we've not decided. The splitting of controllers into 
their own hierarchy is something we're considering independently (as a 
consequence of Linus' suggestion at KS04). But making the interface 
completely per-controller is something we can do, without too much 
effort, IF there is sufficient reason (we have other reasons for doing 
that as well - see recent postings on ckrm-tech).

Interest/recommendations from the community that cpusets  be part of 
CKRM's hierarchy would certainly be a factor in that decision.

> 
> For another way to put the difference, CKRM is managing "commodity"
> resources, such as cycles and bits.  One cycle is as good as the
> next; it's just a question of who gets how many.  On the other hand,
> cpusets manage precious named resources - such as an entire block
> of 64 CPUs and associated memory on a 256 CPU system.  

> Each such
> cpuset is a unique, named, first class, relatively long lasting
> entity represented by its own directory in the cpuset file system,
> and assigned a specific well known job to execute.

s/cpuset/class and s/cpuset file system/rcfs and this pretty much
describes CKRM.

> 
> So what interaction or relationship if any do I see between cpusets
> and CKRM?  Only one at the moment.  A major job running within a
> long lasting cpuset might well want to make use of CKRM in order to
> provide refined Qualities of Service to its clients.  This means that
> the CKRM instance would need to understand that it's not managing
> the entire physical system, but just some cpuset-defined subset.

This brings up a very important point. If CKRM's cpu controller is 
managing cpu time and cpusets are also operational, it might be hard for 
one or the other to achieve their objectives since they're both trying 
to constrain CPU usage along different dimensions.

But in a sense, CKRM already faces this problem since cpu, mem and io 
are not completely independent resources. We're pretty much relying on 
the sysadmin/user not to set wildly conflicting sets of shares for these 
resources and can have the same expectation from someone trying to use 
both CKRM cpu controller and cpusets at the same time.

> 
> A few days ago, one of the CKRM gurus encouraged me to look forward
> to providing a CKRM controller for cpusets.  At the time, I nodded
> knowingly at my screen, as if that all made sense.
> 
> Now, I've no clue what such a controller would be or do, or why anyone
> would want one.

Such a controller would be a different packaging of the cpusets patch 
with most of its internals remaining the same but using the CKRM 
interfaces, as Erich had pointed out.

> I look forward to having my likely serious confusions over CKRM
> corrected.  Meanwhile, I remain convinced that cpusets and CKRM are
> separate and distinct projects, and that neither should wait for
> the other.

On the non-technical front, this is desirable. Tying two projects 
together always runs the risk that one drags the other down. CKRM also 
faces this dilemma while considering a switch from using relayfs to 
netlink as the kernel-user communication channel. We think relayfs suits 
our needs better but given the problems the project has, can't afford to 
tie ourselves down to it.

Broadly, CKRM is not just a collection of controllers which operate on 
arbitrary groups of kernel objects, but also a framework for such 
controllers. In its latter role, it has a place for cpusets.

However, cpusets has little need for CKRM except for the commonalities 
in the interfaces and that too, if and when CKRM adopts the changes 
needed by cpusets.

So the bottomline, IMHO, is the interface - should there be one or two ? 
One can argue either way. There are already so many filesystems, whats 
one more ? CKRM doesn't encompass other "grouping" resource controllers 
such as outbound network (yet!) so why try to shoehorn cpusets into it ?

On the other hand, the user may appreciate one-stop-shops for similar 
kinds of resource management and would probably benefit from an 
integration of interfaces. And there is a merit to the argument that 
interfaces, once adopted in the mainline, will be hard to change.

Rusty's keynote at OLS2003 advised "work on the interfaces last". 
Evidently that advice isn't operative here ! Future incompatibility of 
interfaces is becoming a blocking factor for acceptance/testing/usage of 
the core functionality.

One suggestion is to go ahead with the  -mm acceptance of cpusets so its 
functionality has a chance to get feedback and address the CKRM 
interface integration a couple of months from now once CKRM's interface 
issues get resolved ? But do let us know if there is interest in merging 
(after this round of clarificatory emails is over) as it will affect 
which way we go.

-- Shailabh

> 
> I continue to recommend that cpusets be accepted into the 2.6.9 mm
> patches, and if that goes well, into Linus' tree.
> 
> Thank-you for reading.
> 
>     (*) The above description of a Class as a Quality of Service
>         does _not_ match the phrase on http://ckrm.sourcefourge.net:
> 	    "A class is a group of Linux tasks (processes), ..."
> 	I'm speculating that this phrase is misleading.  More
> 	likely, it's just that I'm confused ;).
> 
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-08-10 22:38             ` Shailabh Nagar
@ 2004-08-11 10:42               ` Erich Focht
  2004-08-11 14:56                 ` Shailabh Nagar
  2004-08-14  8:51               ` Paul Jackson
  1 sibling, 1 reply; 53+ messages in thread
From: Erich Focht @ 2004-08-11 10:42 UTC (permalink / raw)
  To: Shailabh Nagar
  Cc: Paul Jackson, Hubertus Franke, mbligh, lse-tech, akpm, hch,
	steiner, jbarnes, sylvain.jeaugey, djh, linux-kernel, colpatch,
	Simon.Derr, ak, sivanich, ckrm-tech

On Wednesday 11 August 2004 00:38, Shailabh Nagar wrote:
> >  Metrics, transactions, tasks, and resource
> >     decisions all have to be tracked or managed by Class.
> > 
> >     These Classes form a fairly shallow hierarchy of usage levels or
> >     service qualities, as perceived by the end users of the system.
> > 
> >     I'd guess that the average lifetime of a Class is months or years,
> >     as they can reflect the relative priority of relations with long
> >     standing, external customers.
> > 
> > Cpusets and CKRM have profoundly different purposes, economics and
> > motivations.
> 
> I would say the methods differ, not the purpose. Both are trying to 
> performance-isolate groups of tasks - one uses the spatial dimension of 
> cpu bindings, the other uses  the temporal dimension of cpu time.

So the purpose is different, too. With your words: spatial versus
temporal separation. They are orthogonal. In physics terms: you need
both to describe the universe and you cannot transform the one into
the other. Both make sense, they can be combined to give more benefit
(aehm, control).


> The other point of difference is the one you'd brought up earlier - ther 
> restrictions on the hierarchy creation. CKRM has none (effectively), 
> cpusets has many.

Don't know how it's exactly implemented, but the restrictions should
not be at hierarchy creation time (i.e. when creating the class
(cpusets) subdirectory). They should be imposed when setting/changing
the attributes. Writing illegal values to the virtual attribute files
must simply fail. And each resource controller knows best what it
allows for and what not, this shouldn't be a task of the
infrastructure (CKRM).


> As CKRM's interface stands today, there are sufficient differences 
> between the interfaces to keep them separate.
> 
> However, if CKRM moves to a model where
> - each controller is allowed to define its own virtual files and attributes
> - each controllers has its own hierarchy (and hence more control over 
> how it can be formed),
> then the similarities will be too many to ignore merger possibilities
> altogether.
> 
> The kicker is, we've not decided. The splitting of controllers into 
> their own hierarchy is something we're considering independently (as a 
> consequence of Linus' suggestion at KS04). But making the interface 
> completely per-controller is something we can do, without too much 
> effort, IF there is sufficient reason (we have other reasons for doing 
> that as well - see recent postings on ckrm-tech).

Having controller specifics less hidden is good because usage becomes
more intuitive and you don't have to RTFM (controller specific manuals
would have to be written, too). One file per attribute is also nicer
than several attributes hidden in a shares files. Adding an attribute
means adding a file, it doesn't break the old interface, so this is
easier to maintain. And, as you mentioned, some files in the current
CKRM interface just don't make sense for some resources. But a sane
ruleset provided by CKRM for external controllers should be
there. For example something like:
   - Class members are added by writing to the vitual file "target".
   - Class members are listed by reading the virtual file "target" and
     the format is ...
   - Each class attribute should be controlled by one file named
     appropriately. Etc...
   - Members of a class can register a callback which will be invoked
     when following events occur:
        - the class is destroyed
	- ... ?
   - etc ...

> Interest/recommendations from the community that cpusets  be part of 
> CKRM's hierarchy would certainly be a factor in that decision.

I'd prefer a single entry point for resource management with
consistent (not necessarilly same) and easy to use user interfaces for
all resources.

Regards,
Erich


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-08-11 10:42               ` Erich Focht
@ 2004-08-11 14:56                 ` Shailabh Nagar
  0 siblings, 0 replies; 53+ messages in thread
From: Shailabh Nagar @ 2004-08-11 14:56 UTC (permalink / raw)
  Cc: lse-tech, hch, steiner, jbarnes, sylvain.jeaugey, djh,
	linux-kernel, colpatch, Simon.Derr, ak, sivanich, ckrm-tech

Erich Focht wrote:
> On Wednesday 11 August 2004 00:38, Shailabh Nagar wrote:
> 
>>> Metrics, transactions, tasks, and resource
>>>    decisions all have to be tracked or managed by Class.
>>>
>>>    These Classes form a fairly shallow hierarchy of usage levels or
>>>    service qualities, as perceived by the end users of the system.
>>>
>>>    I'd guess that the average lifetime of a Class is months or years,
>>>    as they can reflect the relative priority of relations with long
>>>    standing, external customers.
>>>
>>>Cpusets and CKRM have profoundly different purposes, economics and
>>>motivations.
>>
>>I would say the methods differ, not the purpose. Both are trying to 
>>performance-isolate groups of tasks - one uses the spatial dimension of 
>>cpu bindings, the other uses  the temporal dimension of cpu time.
> 
> 
> So the purpose is different, too. With your words: spatial versus
> temporal separation. They are orthogonal. 

By purpose, I meant "performance isolation". Method used is spatial 
vs. temporal. But I guess thats just quibbling over words. The 
approaches are certainly orthogonal.

Also, cpusets have a purpose beyond isolation and that is 
optimization. One might want to restrict tasks/apps to a NUMA node for 
reducing avg mem latency - this is completely beyond CKRM's scope.

In physics terms: you need
> both to describe the universe and you cannot transform the one into
> the other. Both make sense, they can be combined to give more benefit
> (aehm, control).

On machines with a fairly large number of cpus, this is true. cpusets 
would partition a machine and CKRM would operate within each partition.

But its less clear whether both CKRM and cpuset approaches can be 
simultaneously used, profitably, on a smaller SMP if one is primarily 
interested in  isolation.

Partitioning the cpus with cpusets does offer harder guarantees, 
replicable isolation etc. but also runs the risk of underutilization.
If the user primarily wants to give 20% to one App, 40% to another, he 
does have to make that call: go with cpusets which offers better 
guarantees but could waste cpus or create ckrm classes which also 
offer this functionality but run the risk of weaker control depending 
on other applications load ?

To further complicate that choice, CKRM's design does provide for 
implementation of hard vs. soft limits where hard limits would provide 
the stronger guarantees that a user might want.

The CKRM CPU controller, in particular, is close (~ two weeks to 
availablity) to providing an implementation of hard limits which would 
offer stronger guarantees along the temporal dimension.

> 
> 
> 
>>The other point of difference is the one you'd brought up earlier - ther 
>>restrictions on the hierarchy creation. CKRM has none (effectively), 
>>cpusets has many.
> 
> 
> Don't know how it's exactly implemented, but the restrictions should
> not be at hierarchy creation time (i.e. when creating the class
> (cpusets) subdirectory). They should be imposed when setting/changing
> the attributes. 

True - I was lumping the "create cpuset + set its cpu ownership 
values" into the hierarchy creation. But the point made still holds 
good, CKRM has no controller-defined restrictions on changing 
attributes, cpusets does.

> Writing illegal values to the virtual attribute files
> must simply fail. And each resource controller knows best what it
> allows for and what not, this shouldn't be a task of the
> infrastructure (CKRM).

Yes, this makes sense.

>>As CKRM's interface stands today, there are sufficient differences 
>>between the interfaces to keep them separate.
>>
>>However, if CKRM moves to a model where
>>- each controller is allowed to define its own virtual files and attributes
>>- each controllers has its own hierarchy (and hence more control over 
>>how it can be formed),
>>then the similarities will be too many to ignore merger possibilities
>>altogether.
>>
>>The kicker is, we've not decided. The splitting of controllers into 
>>their own hierarchy is something we're considering independently (as a 
>>consequence of Linus' suggestion at KS04). But making the interface 
>>completely per-controller is something we can do, without too much 
>>effort, IF there is sufficient reason (we have other reasons for doing 
>>that as well - see recent postings on ckrm-tech).
> 
> 
> Having controller specifics less hidden is good because usage becomes
> more intuitive and you don't have to RTFM (controller specific manuals
> would have to be written, too). One file per attribute is also nicer
> than several attributes hidden in a shares files. Adding an attribute
> means adding a file, it doesn't break the old interface, so this is
> easier to maintain. And, as you mentioned, some files in the current
> CKRM interface just don't make sense for some resources. But a sane
> ruleset provided by CKRM for external controllers should be
> there. For example something like:
>    - Class members are added by writing to the vitual file "target".
>    - Class members are listed by reading the virtual file "target" and
>      the format is ...
>    - Each class attribute should be controlled by one file named
>      appropriately. Etc...
>    - Members of a class can register a callback which will be invoked
>      when following events occur:
>         - the class is destroyed
> 	- ... ?
>    - etc ...

One file per attribute is an excellent idea and the slight additional 
overhead won't matter since attribute changes are rarely in the 
critical path. Will follow up on this on ckrm-tech (which is cc'ed).

We'll still need to keep statistics grouped as far as possible because 
  the overhead of reading several files vs. one will matter.

> 
> 
>>Interest/recommendations from the community that cpusets  be part of 
>>CKRM's hierarchy would certainly be a factor in that decision.
> 
> 
> I'd prefer a single entry point for resource management with
> consistent (not necessarilly same) and easy to use user interfaces for
> all resources.
> 
> Regards,
> Erich
> 

P.S. I've pruned some of the names on the cc: list who are obviously 
subscribed to one or the other lists (mailman on sf keeps complaining 
if the cc list is too long). I can be dropped from the cc: too if this 
thread continues...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-08-10 22:38             ` Shailabh Nagar
  2004-08-11 10:42               ` Erich Focht
@ 2004-08-14  8:51               ` Paul Jackson
  1 sibling, 0 replies; 53+ messages in thread
From: Paul Jackson @ 2004-08-14  8:51 UTC (permalink / raw)
  To: Shailabh Nagar
  Cc: frankeh, efocht, mbligh, lse-tech, akpm, hch, steiner, jbarnes,
	sylvain.jeaugey, djh, linux-kernel, colpatch, Simon.Derr, ak,
	sivanich, ckrm-tech

Shailabh writes:
> But do let us know if there is interest in merging 
> (after this round of clarificatory emails is over) as it will affect 
> which way we go.

I remain convinced that such a merging would be wrong headed.

When I examine the experience that other operating systems such as
Solaris, Unicos and Irix have had with resource share groups and
cpusets, they have considered these to be two distinct facilities,
correctly so in my view.

So I recommend that you not try to bend CKRM to include cpusets.

Unless others have more to say, I too am content to close this thread
for now.  I've email-bombed enough mail boxes for one week ;).

I'll have an updated cpuset patch, hopefully next week, hopefully
with a shorter Cc list this time, with a couple of modest fixes.

Thank-you.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-01 23:41       ` Andrew Morton
  2004-10-02  6:06         ` Paul Jackson
@ 2004-10-02 15:46         ` Marc E. Fiuczynski
  2004-10-02 16:17           ` Hubertus Franke
  2004-10-02 17:47           ` Paul Jackson
  1 sibling, 2 replies; 53+ messages in thread
From: Marc E. Fiuczynski @ 2004-10-02 15:46 UTC (permalink / raw)
  To: Andrew Morton, Shailabh Nagar, ckrm-tech
  Cc: pj, efocht, mbligh, lse-tech, hch, steiner, jbarnes,
	sylvain.jeaugey, djh, linux-kernel, colpatch, Simon.Derr, ak,
	sivanich, Larry Peterson

Paul & Andrew,

For PlanetLab (www.planet-lab.org) we also care very much about isolation
between different users.  Maybe not to the same degree as your users.
Nonetheless, penning in resource hogs is very important to us.  We are
giving CKRM a shot.  Over the past two weeks I have worked with Hubertus,
Chandra, and Shailabh to iron various bugs.  The controllers appear to be
working at first approximation.  From our perspective, it is not so much the
specific resource controllers but the CKRM framework that is of importance.
I.e., we certainly plan to test and implement other resource controllers for
CPU, disk I/o and memory isolation.

For cpu isolation, would it suffice to use a HTB-based cpu scheduler.  This
is essentially what the XEN folks are using to ensure strong isolation
between separate Xen domains.  An implementation of such a scheduler exists
as part of the linux-vserver project and the port of that to CKRM should be
straightforward.  In fact, I am thinking of doing such a port for PlanetLab
just to have an alternative to the existing CKRM cpu controller. Seems like
an implementation of that scheduler (or a modification to the existing CKRM
controller) + some support for CPU affinity + hotplug CPU support might
approach your cpuset solution. Correct me if I completely missed it.

For memory isolation, I am not sufficiently familiar with NUMA style
machines to comment on this topic.  The CKRM memory controller is
interesting, but we have not used it sufficiently to comment.

Finally, in terms of isolation, we have mixed together CKRM with VSERVERs.
Using CKRM for performance isolation and Vserver (for the lack of a better
name) "view" isolation.  Maybe your users care about the vserver style of
islation.  We have an anon cvs server with our kernel (which is based on
Fedora Core 2 1.521 + vserver 1.9.2 + the latest ckrm e16 framework and
resource controllers that are not even available yet at ckrm.sf.net), which
you are welcome to play with.

Best regards,
Marc

-----------
Marc E. Fiuczynski
PlanetLab Consortium --- OS Taskforce PM
Princeton University --- Research Scholar
http://www.cs.princeton.edu/~mef

> -----Original Message-----
> From: ckrm-tech-admin@lists.sourceforge.net
> [mailto:ckrm-tech-admin@lists.sourceforge.net]On Behalf Of Andrew Morton
> Sent: Friday, October 01, 2004 7:41 PM
> To: Shailabh Nagar; ckrm-tech@lists.sourceforge.net
> Cc: pj@sgi.com; efocht@hpce.nec.com; mbligh@aracnet.com;
> lse-tech@lists.sourceforge.net; hch@infradead.org; steiner@sgi.com;
> jbarnes@sgi.com; sylvain.jeaugey@bull.net; djh@sgi.com;
> linux-kernel@vger.kernel.org; colpatch@us.ibm.com; Simon.Derr@bull.net;
> ak@suse.de; sivanich@sgi.com
> Subject: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and
> memory placement
>
>
>
> Paul, I'm having second thoughts regarding a cpusets merge.  Having gone
> back and re-read the cpusets-vs-CKRM thread from mid-August, I am quite
> unconvinced that we should proceed with two orthogonal resource
> management/partitioning schemes.
>
> And CKRM is much more general than the cpu/memsets code, and hence it
> should be possible to realize your end-users requirements using an
> appropriately modified CKRM, and a suitable controller.
>
> I'd view the difficulty of implementing this as a test of the wisdom of
> CKRM's design, actually.
>
> The clearest statement of the end-user cpu and memory partitioning
> requirement is this, from Paul:
>
> > Cpusets - Static Isolation:
> >
> >     The essential purpose of cpusets is to support isolating large,
> >     long-running, multinode compute bound HPC (high performance
> >     computing) applications or relatively independent service jobs,
> >     on dedicated sets of processor and memory nodes.
> >
> >     The (unobtainable) ideal of cpusets is to provide perfect
> >     isolation, for such jobs as:
> >
> >      1) Massive compute jobs that might run hours or days, on dozens
> > 	or hundreds of processors, consuming gigabytes or terabytes
> > 	of main memory.  These jobs are often highly parallel, and
> > 	carefully sized and placed to obtain maximum performance
> > 	on NUMA hardware, where memory placement and bandwidth is
> > 	critical.
> >
> >      2) Independent services for which dedicated compute resources
> >         have been purchased or allocated, in units of one or more
> > 	CPUs and Memory Nodes, such as a web server and a DBMS
> > 	sharing a large system, but staying out of each others way.
> >
> >     The essential new construct of cpusets is the set of dedicated
> >     compute resources - some processors and memory.  These sets have
> >     names, permissions, an exclusion property, and can be subdivided
> >     into subsets.
> >
> >     The cpuset file system models a hierarchy of 'virtual computers',
> >     which hierarchy will be deeper on larger systems.
> >
> >     The average lifespan of a cpuset used for (1) above is probably
> >     between hours and days, based on the job lifespan, though a couple
> >     of system cpusets will remain in place as long as the system is
> >     running.  The cpusets in (2) above might have a longer lifespan;
> >     you'd have to ask Simon Derr of Bull about that.
> >
>
> Now, even that is not a very good end-user requirement because it does
> prejudge the way in which the requirement's solution should be
> implemented.
>  Users don't require that their NUMA machines "model a hierarchy of
> 'virtual computers'".  Users require that their NUMA machines implement
> some particular behaviour for their work mix.  What is that behaviour?
>
> For example, I am unable to determine from the above whether the users
> would be 90% satisfied with some close-enough ruleset which was
> implemented
> with even the existing CKRM cpu and memory governors.
>
> So anyway, I want to reopen this discussion, and throw a huge spanner in
> your works, sorry.
>
> I would ask the CKRM team to tell us whether there has been any
> progress in
> this area, whether they feel that they have a good understanding
> of the end
> user requirement, and to sketch out a design with which CKRM could satisfy
> that requirement.
>
> Thanks.
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
> Use IT products in your business? Tell us what you think of them. Give us
> Your Opinions, Get Free ThinkGeek Gift Certificates! Click to
> find out more
> http://productguide.itmanagersjournal.com/guidepromo.tmpl
> _______________________________________________
> ckrm-tech mailing list
> https://lists.sourceforge.net/lists/listinfo/ckrm-tech


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-02 15:46         ` Marc E. Fiuczynski
@ 2004-10-02 16:17           ` Hubertus Franke
  2004-10-02 17:53             ` Paul Jackson
  2004-10-02 20:40             ` Andrew Morton
  2004-10-02 17:47           ` Paul Jackson
  1 sibling, 2 replies; 53+ messages in thread
From: Hubertus Franke @ 2004-10-02 16:17 UTC (permalink / raw)
  To: Marc E. Fiuczynski
  Cc: Andrew Morton, Shailabh Nagar, ckrm-tech, pj, efocht, mbligh,
	lse-tech, hch, steiner, jbarnes, sylvain.jeaugey, djh,
	linux-kernel, colpatch, Simon.Derr, ak, sivanich, Larry Peterson



Marc E. Fiuczynski wrote:

> Paul & Andrew,
> 
> For PlanetLab (www.planet-lab.org) we also care very much about isolation
> between different users.  Maybe not to the same degree as your users.
> Nonetheless, penning in resource hogs is very important to us.  We are
> giving CKRM a shot.  Over the past two weeks I have worked with Hubertus,
> Chandra, and Shailabh to iron various bugs.  The controllers appear to be
> working at first approximation.  From our perspective, it is not so much the
> specific resource controllers but the CKRM framework that is of importance.
> I.e., we certainly plan to test and implement other resource controllers for
> CPU, disk I/o and memory isolation.
> 
> For cpu isolation, would it suffice to use a HTB-based cpu scheduler.  This
> is essentially what the XEN folks are using to ensure strong isolation
> between separate Xen domains.  An implementation of such a scheduler exists
> as part of the linux-vserver project and the port of that to CKRM should be
> straightforward.  In fact, I am thinking of doing such a port for PlanetLab
> just to have an alternative to the existing CKRM cpu controller. Seems like
> an implementation of that scheduler (or a modification to the existing CKRM
> controller) + some support for CPU affinity + hotplug CPU support might
> approach your cpuset solution. Correct me if I completely missed it.

Marc, cpusets lead to physical isolation.

> 
> For memory isolation, I am not sufficiently familiar with NUMA style
> machines to comment on this topic.  The CKRM memory controller is
> interesting, but we have not used it sufficiently to comment.
> 
> Finally, in terms of isolation, we have mixed together CKRM with VSERVERs.
> Using CKRM for performance isolation and Vserver (for the lack of a better
> name) "view" isolation.  Maybe your users care about the vserver style of
> islation.  We have an anon cvs server with our kernel (which is based on
> Fedora Core 2 1.521 + vserver 1.9.2 + the latest ckrm e16 framework and
> resource controllers that are not even available yet at ckrm.sf.net), which
> you are welcome to play with.
> 
> Best regards,
> Marc
> 
> -----------
> Marc E. Fiuczynski
> PlanetLab Consortium --- OS Taskforce PM
> Princeton University --- Research Scholar
> http://www.cs.princeton.edu/~mef
> 
> 
>>-----Original Message-----
>>From: ckrm-tech-admin@lists.sourceforge.net
>>[mailto:ckrm-tech-admin@lists.sourceforge.net]On Behalf Of Andrew Morton
>>Sent: Friday, October 01, 2004 7:41 PM
>>To: Shailabh Nagar; ckrm-tech@lists.sourceforge.net
>>Cc: pj@sgi.com; efocht@hpce.nec.com; mbligh@aracnet.com;
>>lse-tech@lists.sourceforge.net; hch@infradead.org; steiner@sgi.com;
>>jbarnes@sgi.com; sylvain.jeaugey@bull.net; djh@sgi.com;
>>linux-kernel@vger.kernel.org; colpatch@us.ibm.com; Simon.Derr@bull.net;
>>ak@suse.de; sivanich@sgi.com
>>Subject: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and
>>memory placement
>>
>>
>>
>>Paul, I'm having second thoughts regarding a cpusets merge.  Having gone
>>back and re-read the cpusets-vs-CKRM thread from mid-August, I am quite
>>unconvinced that we should proceed with two orthogonal resource
>>management/partitioning schemes.
>>
>>And CKRM is much more general than the cpu/memsets code, and hence it
>>should be possible to realize your end-users requirements using an
>>appropriately modified CKRM, and a suitable controller.
>>
>>I'd view the difficulty of implementing this as a test of the wisdom of
>>CKRM's design, actually.
>>
>>The clearest statement of the end-user cpu and memory partitioning
>>requirement is this, from Paul:
>>
>>
>>>Cpusets - Static Isolation:
>>>
>>>    The essential purpose of cpusets is to support isolating large,
>>>    long-running, multinode compute bound HPC (high performance
>>>    computing) applications or relatively independent service jobs,
>>>    on dedicated sets of processor and memory nodes.
>>>
>>>    The (unobtainable) ideal of cpusets is to provide perfect
>>>    isolation, for such jobs as:
>>>
>>>     1) Massive compute jobs that might run hours or days, on dozens
>>>	or hundreds of processors, consuming gigabytes or terabytes
>>>	of main memory.  These jobs are often highly parallel, and
>>>	carefully sized and placed to obtain maximum performance
>>>	on NUMA hardware, where memory placement and bandwidth is
>>>	critical.
>>>
>>>     2) Independent services for which dedicated compute resources
>>>        have been purchased or allocated, in units of one or more
>>>	CPUs and Memory Nodes, such as a web server and a DBMS
>>>	sharing a large system, but staying out of each others way.
>>>
>>>    The essential new construct of cpusets is the set of dedicated
>>>    compute resources - some processors and memory.  These sets have
>>>    names, permissions, an exclusion property, and can be subdivided
>>>    into subsets.
>>>
>>>    The cpuset file system models a hierarchy of 'virtual computers',
>>>    which hierarchy will be deeper on larger systems.
>>>
>>>    The average lifespan of a cpuset used for (1) above is probably
>>>    between hours and days, based on the job lifespan, though a couple
>>>    of system cpusets will remain in place as long as the system is
>>>    running.  The cpusets in (2) above might have a longer lifespan;
>>>    you'd have to ask Simon Derr of Bull about that.
>>>
>>
>>Now, even that is not a very good end-user requirement because it does
>>prejudge the way in which the requirement's solution should be
>>implemented.
>> Users don't require that their NUMA machines "model a hierarchy of
>>'virtual computers'".  Users require that their NUMA machines implement
>>some particular behaviour for their work mix.  What is that behaviour?
>>
>>For example, I am unable to determine from the above whether the users
>>would be 90% satisfied with some close-enough ruleset which was
>>implemented
>>with even the existing CKRM cpu and memory governors.
>>
>>So anyway, I want to reopen this discussion, and throw a huge spanner in
>>your works, sorry.
>>
>>I would ask the CKRM team to tell us whether there has been any
>>progress in
>>this area, whether they feel that they have a good understanding
>>of the end
>>user requirement, and to sketch out a design with which CKRM could satisfy
>>that requirement.
>>
>>Thanks.
>>
>>
>>-------------------------------------------------------
>>This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
>>Use IT products in your business? Tell us what you think of them. Give us
>>Your Opinions, Get Free ThinkGeek Gift Certificates! Click to
>>find out more
>>http://productguide.itmanagersjournal.com/guidepromo.tmpl
>>_______________________________________________
>>ckrm-tech mailing list
>>https://lists.sourceforge.net/lists/listinfo/ckrm-tech
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
> Use IT products in your business? Tell us what you think of them. Give us
> Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
> http://productguide.itmanagersjournal.com/guidepromo.tmpl
> _______________________________________________
> Lse-tech mailing list
> Lse-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/lse-tech
> 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-02 15:46         ` Marc E. Fiuczynski
  2004-10-02 16:17           ` Hubertus Franke
@ 2004-10-02 17:47           ` Paul Jackson
  1 sibling, 0 replies; 53+ messages in thread
From: Paul Jackson @ 2004-10-02 17:47 UTC (permalink / raw)
  To: Marc E. Fiuczynski
  Cc: akpm, nagar, ckrm-tech, efocht, mbligh, lse-tech, hch, steiner,
	jbarnes, sylvain.jeaugey, djh, linux-kernel, colpatch, Simon.Derr,
	ak, sivanich, llp

Marc writes:
>
> For PlanetLab (www.planet-lab.org) we also care very much about isolation
> between different users.  Maybe not to the same degree as your users.
> Nonetheless, penning in resource hogs is very important to us. 

Thank-you for your report, Marc.

Before I look at code, I think we could do with a little more
discussion of usage patterns and requirements.

Despite my joke about "1) isolation, 2) isolation, and 3) isolation"
being the most important requirements on cpusets, there are further
requirements presented by typical cpuset users, which I tried to spell
out in my previous post.

Could you do a couple more things to further help this discussion:

 1) I know nothing at this moment of what PlanetLab is or what
    they do.  Could you describe this a bit - your business, your
    customers usage patterns and how these make use of CKRM?  Perhaps
    a couple of web links will help here.  I will also do a Google
    search now, in an effort to become more educated on PlanetLab.

    I might come away from this thinking one of:

	a. Dang - that sounds alot like what my cpuset users are
	   doing.  If CKRM meets PlanetLab's needs, it might meet
	   my users needs too.  I should put aside my skepticism
	   and approach Andrew's proposal to have CKRM supplant
	   cpusets with a more open mind than (I will confess)
	   I have now.

	b. No, no - that's something different.  PlanetLab doesn't
	   have the particular requirements x, y and z that my cpuset
	   users do.  Rather they have other requirements, a, b and
	   c, that seem to fit my understanding of CKRM well, but
	   not cpusets.

 2) I made some effort to present the usage patterns and
    requirements of cpuset users in my post.  Could you read
    it and comment on the requirements I presented. 

    I'd be interested to know, for each cpuset requirement I
    presented, which of the following multiple choices applies
    in your case:

	a. Huh - I (Marc) don't understand what you (pj) are
           saying here well enough to comment further.

	b. Yes - this sounds just like something PlanetLab needs,
	   perhaps rephrasing the requirement in terms more familiar
	   to you.  And CKRM meets this requirement this way ...

	c. No - this is not a big need PlanetLab has of its resource
	   management technology (perhaps noting in this case,
	   whether, in your understanding of CKRM, CKRM addresses
	   this requirement anyway, even though you don't need it).

I encourage you to stay "down to earth" in this, at least initially.
Speak in terms familiar to you, and present the actual, practical
experience you've gained at PlanetLab.

I want to avoid the trap of premature abstraction:

    Gee - both CKRM and cpusets deal with resource management, both
    have kernel hooks in the allocators and schedulers, both have
    hierarchies and both provide isolation of some sort.  They must
    be two solutions to the same problem (or at least, since CKRM
    is obviously bigger, it must be a solution to a superset of
    the problems that cpusets addresses), and so we should pick one
    (the superset, no doubt) and drop the other to avoid duplication.

Let us begin this discussion with a solid grounding in the actual
experiences we bring to this thread.

Thank-you.

	"I'm thinking of a 4 legged, long tailed, warm blooded
	creature, commonly associated with milk, that makes a
	sound written in my language starting with the letter 'M'.
	The name of the animal is a three letter word starting
	with the letter 'C'.  We had many of them in the barn on
	my Dad's dairy farm."

	Mooo ?		[cow]

	No - meow.	[cat]

	And no, we shouldn't try to catch mice with cows, even
	if they are bigger than cats.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-02 16:17           ` Hubertus Franke
@ 2004-10-02 17:53             ` Paul Jackson
  2004-10-02 18:16               ` Hubertus Franke
  2004-10-02 20:40             ` Andrew Morton
  1 sibling, 1 reply; 53+ messages in thread
From: Paul Jackson @ 2004-10-02 17:53 UTC (permalink / raw)
  To: Hubertus Franke
  Cc: mef, akpm, nagar, ckrm-tech, efocht, mbligh, lse-tech, hch,
	steiner, jbarnes, sylvain.jeaugey, djh, linux-kernel, colpatch,
	Simon.Derr, ak, sivanich, llp

Hubertus wrote:
>
> Marc, cpusets lead to physical isolation.

This is slightly too terse for my dense brain to grok.
Could you elaborate just a little, Hubertus?  Thanks.

(Try to quote less - I almost missed your reply in
 the middle of all the quoted stuff.)

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-02 17:53             ` Paul Jackson
@ 2004-10-02 18:16               ` Hubertus Franke
  2004-10-02 19:14                 ` Paul Jackson
  2004-10-02 23:29                 ` Peter Williams
  0 siblings, 2 replies; 53+ messages in thread
From: Hubertus Franke @ 2004-10-02 18:16 UTC (permalink / raw)
  To: Paul Jackson
  Cc: akpm, ckrm-tech, efocht, lse-tech, hch, steiner, jbarnes,
	sylvain.jeaugey, djh, linux-kernel, colpatch, Simon.Derr, ak,
	sivanich, llp

Paul Jackson wrote:
> Hubertus wrote:
> 
>>Marc, cpusets lead to physical isolation.
> 
> 
> This is slightly too terse for my dense brain to grok.
> Could you elaborate just a little, Hubertus?  Thanks.
> 

A minimal quote from your website :-)

"CpuMemSets provides a new Linux kernel facility that enables system 
services and applications to specify on which CPUs they may be 
scheduled, and from which nodes they may allocate memory."

Since I have addressed the cpu section it seems obvious that
in order to ISOLATE different workloads, you associate them onto
non-overlapping cpusets, thus technically they are physically isolated
from each other on said chosen CPUs.

Given that cpuset hierarchies translate into cpu-affinity masks,
this desired isolation can result in lost cycles globally.

I believe this to be orthogonal to share settings. To me both
are extremely desirable features.

I also pointed out that if you separate mechanism from API, it
is possible to move the CPU set API under the CKRM framework.
I have not thought about the memory aspect.

-- Hubertus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-02 18:16               ` Hubertus Franke
@ 2004-10-02 19:14                 ` Paul Jackson
  2004-10-02 23:29                 ` Peter Williams
  1 sibling, 0 replies; 53+ messages in thread
From: Paul Jackson @ 2004-10-02 19:14 UTC (permalink / raw)
  To: Hubertus Franke
  Cc: akpm, ckrm-tech, efocht, lse-tech, hch, steiner, jbarnes,
	sylvain.jeaugey, djh, linux-kernel, colpatch, Simon.Derr, ak,
	sivanich, llp

Hubertus wrote:
> 
> A minimal quote from your website :-)

Ok - now I see what you're saying.

Let me expound a bit on this line, from a different perspective.

While big NUMA boxes provide the largest available single system image
boxes available currently, they have their complications.  The bus and
cache structures and geometry are complex and multilayered.

For more modest, more homogenous systems, one can benefit from putting
CKRM controllers (I hope I'm using this term correctly here) on things
like memory pages, cpu cycles, disk i/o, and network i/o in order to
provide a fairly rich degree of control over what share of resources
each application class receives, and obtain both efficient and
controlled balance of resource usage.

But for the big NUMA configuration, running some of our customers most
performance critical applications, one cannot achieve the desired
performance by trying to control all the layers of cache and bus, in
complex geometries, with their various interactions.

So instead one ends up using an orthogonal (thanks, Hubertus) and
simpler mechanism - physical isolation(*).  These nodes, and all their
associated hardware, are dedicated to the sole use of this critical
application.  There is still sometimes non-trivial work done, for a
given application, to tune its performance, but by removing (well, at
least radically reducing) the interactions of other unknown applications
on the same hardware resources, the tuning of the critical application
now becomes a practical, solvable task.

In corporate organizations, this resembles the difference between having
separate divisions with their own P&L statements, kept at arms length
for all but a few common corporate services [cpusets], versus the more
dynamic trade-offs made within a single division, moving limited
resources back and forth in order to meet changing and sometimes
conflicting objectives in accordance with the priorities dictated by
upper management [CKRM].

 (*) Well, not physical isolation in the sense of unplugging the
     interconnect cables.  Rather logical isolation of big chunks
     of the physical hardware.  And not pure 100% isolation, as
     would come from running separate kernel images, but minimal
     controlled isolation, with the ability to keep out anything
     that causes interference if it doesn't need to be there, on
     those particular CPUs and Memory Nodes.

     And our customers _do_ want to manage these logically isolated
     chunks as named "virtual computers" with system managed permissions
     and integrity (such as the system-wide attribute of "Exclusive"
     ownership of a CPU or Memory by one cpuset, and a robust ability
     to list all tasks currently in a cpuset).  This is a genuine user
     requirement to my understanding, apparently contrary to Andrew's.

The above is not the only use of cpusets - there's also providing
a base for ports of PBS and LSF workload managers (which if I recall
correctly arose from earlier HPC environments similar to the one
I described above), and there's the work being done by Bull and NEC,
which can better be spoken to by representives of those companies.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-02 16:17           ` Hubertus Franke
  2004-10-02 17:53             ` Paul Jackson
@ 2004-10-02 20:40             ` Andrew Morton
  2004-10-02 23:08               ` Hubertus Franke
  2004-10-03  2:26               ` Paul Jackson
  1 sibling, 2 replies; 53+ messages in thread
From: Andrew Morton @ 2004-10-02 20:40 UTC (permalink / raw)
  To: Hubertus Franke
  Cc: mef, nagar, ckrm-tech, pj, efocht, mbligh, lse-tech, hch, steiner,
	jbarnes, sylvain.jeaugey, djh, linux-kernel, colpatch, Simon.Derr,
	ak, sivanich, llp

Hubertus Franke <frankeh@watson.ibm.com> wrote:
>
> Marc, cpusets lead to physical isolation.

Despite what Paul says, his customers *do not* "require" physical isolation
[*].  That's like an accountant requiring that his spreadsheet be written
in Pascal.  He needs slapping.

Isolation is merely the means by which cpusets implements some higher-level
customer requirement.

I want to see a clearer description of what that higher-level requirement is.

Then I'd like to see some thought put into whether CKRM (with probably a new
controller) can provide a good-enough implementation of that requirement.

Coming at this from the other direction: CKRM is being positioned as a
general purpose resource management framework, yes?  Isolation is a simple
form of resource management.  If the CKRM framework simply cannot provide
this form of isolation then it just failed its first test, did it not?

[*] Except for the case where there is graphics (or other) hardware close
to a particular node.  In that case it is obvious that CPU-group pinning is
the only way in which to satisfy the top-level requirement of "make access
to the graphics hardware be efficient".

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-02 23:08               ` Hubertus Franke
@ 2004-10-02 22:26                 ` Alan Cox
  2004-10-03  2:49                 ` Paul Jackson
  2004-10-03  3:25                 ` Paul Jackson
  2 siblings, 0 replies; 53+ messages in thread
From: Alan Cox @ 2004-10-02 22:26 UTC (permalink / raw)
  To: Hubertus Franke
  Cc: Andrew Morton, mef, nagar, ckrm-tech, pj, efocht, mbligh,
	lse-tech, hch, steiner, jbarnes, sylvain.jeaugey, djh,
	Linux Kernel Mailing List, colpatch, Simon.Derr, ak, sivanich,
	llp

On Sul, 2004-10-03 at 00:08, Hubertus Franke wrote:
> Andrew Morton wrote:
> > Hubertus Franke <frankeh@watson.ibm.com> wrote:
> > 
> >>Marc, cpusets lead to physical isolation.

Not realistically on x86 unless you start billing memory accesses IMHO


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-02 20:40             ` Andrew Morton
@ 2004-10-02 23:08               ` Hubertus Franke
  2004-10-02 22:26                 ` Alan Cox
                                   ` (2 more replies)
  2004-10-03  2:26               ` Paul Jackson
  1 sibling, 3 replies; 53+ messages in thread
From: Hubertus Franke @ 2004-10-02 23:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mef, nagar, ckrm-tech, pj, efocht, mbligh, lse-tech, hch, steiner,
	jbarnes, sylvain.jeaugey, djh, linux-kernel, colpatch, Simon.Derr,
	ak, sivanich, llp

Andrew Morton wrote:
> Hubertus Franke <frankeh@watson.ibm.com> wrote:
> 
>>Marc, cpusets lead to physical isolation.
> 
> 
> Despite what Paul says, his customers *do not* "require" physical isolation
> [*].  That's like an accountant requiring that his spreadsheet be written
> in Pascal.  He needs slapping.
> 
> Isolation is merely the means by which cpusets implements some higher-level
> customer requirement.
> 
> I want to see a clearer description of what that higher-level requirement is.
> 
> Then I'd like to see some thought put into whether CKRM (with probably a new
> controller) can provide a good-enough implementation of that requirement.
> 

CKRM could do so. We already provide the name space and the class 
hierarchy. If a cpuset is associated with a class, then the class 
controller can sets the appropriate masks in the system.

The issue that Paul correctly pointed out is that if you associate the 
current task classes, i.e. set cpu and i/o shares then one MIGHT have 
conflicting directives to the system.
This can be avoided by not utilizing cpu shares at that point or live
with the potential share inbalance that will arrive from being forced 
into the various affinity constraints of the tasks.
But we already have to live with that anyway when resources create 
dependencies, such as to little memory can potentially impact obtained 
cpu share.

Alternatively, cpumem set could be introduced as a whole new classtype
that similar to the socket class type will have this one controller 
associated.

So to me cpumem sets as as concept is useful, so I won't be doing that 
whopping, but it can be integrated into CKRM as classtype/controller 
concept. Particularly for NUMA machine it makes sense in the absense of 
more sophisticated and (sub)optimal placement by the OS.

> Coming at this from the other direction: CKRM is being positioned as a
> general purpose resource management framework, yes?  Isolation is a simple
> form of resource management.  If the CKRM framework simply cannot provide
> this form of isolation then it just failed its first test, did it not?
> 

That's fair to say, I think it is feasible, by utilizing the guts of the 
cpumem set and wrapping the CKRM RCFS and class objects around it.

> [*] Except for the case where there is graphics (or other) hardware close
> to a particular node.  In that case it is obvious that CPU-group pinning is
> the only way in which to satisfy the top-level requirement of "make access
> to the graphics hardware be efficient".

Yipp ...  but it is also useful if one has limited faith in the system 
to always the right thing. If I have no control over where tasks go, I 
can potentially end up introducing heavy bus traffic (over NUMA link).
There's a good reason why in many HPC deployment, application try to by 
pass the OS ...

Hope this helps.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-02 18:16               ` Hubertus Franke
  2004-10-02 19:14                 ` Paul Jackson
@ 2004-10-02 23:29                 ` Peter Williams
  2004-10-02 23:51                   ` Hubertus Franke
  1 sibling, 1 reply; 53+ messages in thread
From: Peter Williams @ 2004-10-02 23:29 UTC (permalink / raw)
  To: Hubertus Franke
  Cc: Paul Jackson, akpm, ckrm-tech, efocht, lse-tech, hch, steiner,
	jbarnes, sylvain.jeaugey, djh, linux-kernel, colpatch, Simon.Derr,
	ak, sivanich, llp

Hubertus Franke wrote:
> 
> 
> Paul Jackson wrote:
> 
>> Hubertus wrote:
>>
>>> Marc, cpusets lead to physical isolation.
>>
>>
>>
>> This is slightly too terse for my dense brain to grok.
>> Could you elaborate just a little, Hubertus?  Thanks.
>>
> 
> A minimal quote from your website :-)
> 
> "CpuMemSets provides a new Linux kernel facility that enables system 
> services and applications to specify on which CPUs they may be 
> scheduled, and from which nodes they may allocate memory."
> 
> Since I have addressed the cpu section it seems obvious that
> in order to ISOLATE different workloads, you associate them onto
> non-overlapping cpusets, thus technically they are physically isolated
> from each other on said chosen CPUs.
> 
> Given that cpuset hierarchies translate into cpu-affinity masks,
> this desired isolation can result in lost cycles globally.

This argument if followed to its logical conclusion would advocate the 
abolition of CPU affinity masks completely.

> 
> I believe this to be orthogonal to share settings. To me both
> are extremely desirable features.
> 
> I also pointed out that if you separate mechanism from API, it
> is possible to move the CPU set API under the CKRM framework.
> I have not thought about the memory aspect.
> 
> -- Hubertus
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-02 23:29                 ` Peter Williams
@ 2004-10-02 23:51                   ` Hubertus Franke
  0 siblings, 0 replies; 53+ messages in thread
From: Hubertus Franke @ 2004-10-02 23:51 UTC (permalink / raw)
  To: Peter Williams
  Cc: Paul Jackson, akpm, ckrm-tech, efocht, lse-tech, hch, steiner,
	jbarnes, sylvain.jeaugey, djh, linux-kernel, colpatch, Simon.Derr,
	ak, sivanich, llp



Peter Williams wrote:

> Hubertus Franke wrote:
> 
>>
>>
>> Paul Jackson wrote:

>> A minimal quote from your website :-)
>>
>> "CpuMemSets provides a new Linux kernel facility that enables system 
>> services and applications to specify on which CPUs they may be 
>> scheduled, and from which nodes they may allocate memory."
>>
>> Since I have addressed the cpu section it seems obvious that
>> in order to ISOLATE different workloads, you associate them onto
>> non-overlapping cpusets, thus technically they are physically isolated
>> from each other on said chosen CPUs.
>>
>> Given that cpuset hierarchies translate into cpu-affinity masks,
>> this desired isolation can result in lost cycles globally.
> 
> 
> This argument if followed to its logical conclusion would advocate the 
> abolition of CPU affinity masks completely.
> 

No, why is that. One can restrict memory on a task and by doing so waste 
  cycles in paging. That does not mean we should get ride of memory 
restrictions or a like.
Loosing cycles is simply an observation of what could happen.

As in any system, over constraining a given workload (wrt to affinity, 
cpu limits, rate control) can lead to suboptimal utilization of 
resources. That does not mean there is no rational for the constraints 
in the first place and hence they should never be allowed in the first 
place.

Cheers ..



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-02 20:40             ` Andrew Morton
  2004-10-02 23:08               ` Hubertus Franke
@ 2004-10-03  2:26               ` Paul Jackson
  2004-10-03 14:11                 ` Paul Jackson
  1 sibling, 1 reply; 53+ messages in thread
From: Paul Jackson @ 2004-10-03  2:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: frankeh, mef, nagar, ckrm-tech, efocht, mbligh, lse-tech, hch,
	steiner, jbarnes, sylvain.jeaugey, djh, linux-kernel, colpatch,
	Simon.Derr, ak, sivanich, llp

Andrew writes:
>
> Despite what Paul says, his customers *do not* "require" physical isolation
> [*].  That's like an accountant requiring that his spreadsheet be written
> in Pascal.  He needs slapping.

No - it's like an accountant saying the books for your two sole
proprietor Subchapter S corporations have to be kept separate.

Consider the following use case scenario, which emphasizes this
isolation aspect (and ignores other requirements, such as the need for
system admins to manage cpusets by name [some handle valid across
process contexts], with a system wide imposed permission model and
exclusive use guarantees, and with a well defined system supported
notion of which tasks are "in" which cpuset at any point in time).

===

You're running a 64-way, compute bound application on 64 CPUs of your
256 CPU system.  The 64 threads are in lock step, tightly coupled, for
three days straight.  You've sized the application and the computer you
bought to run that application to within the last few percent of what
CPU cycles are available on 64 CPUs and how many memory pages are
available on the nodes local to those CPUs.  It's an MPT application in
Fortran, using most of the available bandwidth between those nodes for
synconization on each loop of the computation.  If a single thread slows
down 10% for any reason, the entire application slows down that much
(sometimes worse), and you have big money on the table, ensuring that
doesn't happen.  You absolutely positively have to complete that
application run on time, in three days (say it's a weather forecast for
four days out).  You've varied the resolution to which you compute the
answer or the size of your input data set or whatever else you could, in
order to obtain the most accurate answer you could, in three days, not
an hour longer.  If the runtimes jump around by more than 5% or 10%,
some Vice President starts losing sleep.  If it's a 20% variation, that
sleep deprived Vice President works for the computer company that sold
you the system.  The boss of the boss of my boss ;).

I now know that every one of these 64 threads is pinned for those three
days.  It's just as pinned as the graphics application that has to be
near its hardware.  Due to both the latency affects of the several
levels of hardware cache (on the CPU chip and off), and the additional
latency affects imposed by the software when it decides on which node to
place a page of memory off a page fault, nothing can move.  Not in, not
out, not within.  To within a fraction of a percent, nothing else may be
allowed onto those nodes, nothing of those 64 threads may be allowed off
those nodes, and none of the threads may be allowed to move within the
64 CPUs.  And not just any random subset of 64 CPUs selected from the
256 available, but a subset that's "close" together, given the complex
geometries of these big systems (minimum number of router hops between
the furthest apart pair of CPUs in the set of 64 CPUs).

 (*) Message Passing Interface (MPI) - http://www.mpi-forum.org

===

It's a requirement, I say.  It's a requirement.  Let the slapping begin ;).

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-02 23:08               ` Hubertus Franke
  2004-10-02 22:26                 ` Alan Cox
@ 2004-10-03  2:49                 ` Paul Jackson
  2004-10-03 12:19                   ` Hubertus Franke
  2004-10-03  3:25                 ` Paul Jackson
  2 siblings, 1 reply; 53+ messages in thread
From: Paul Jackson @ 2004-10-03  2:49 UTC (permalink / raw)
  To: Hubertus Franke
  Cc: akpm, mef, nagar, ckrm-tech, efocht, mbligh, lse-tech, hch,
	steiner, jbarnes, sylvain.jeaugey, djh, linux-kernel, colpatch,
	Simon.Derr, ak, sivanich, llp

Hubertus wrote:
>
> CKRM could do so. We already provide the name space and the class 
> hierarchy.

Just because two things have name spaces and hierarchies, doesn't
make them interchangeable.  Name spaces and hierarchies are just
implementation mechanisms - many interesting, entirely unrelated,
solutions make use of them.

What are the objects named, and what is the relation underlying
the hierarchy?  These must match up.

The objects named in cpusets are subsets of a systems CPUs and Memory
Nodes. The relation underlying the hierarchy is the subset relation on
these sets: if one cpuset node is a descendent of another, then its
CPUs and Memory Nodes are a subset of the others.

What is the corresponding statement for CKRM?

For CKRM to subsume cpusets, there must be an injective map from the
above cpuset objects to CKRM objects, that preserves this subset
relation on cpusets.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-02 23:08               ` Hubertus Franke
  2004-10-02 22:26                 ` Alan Cox
  2004-10-03  2:49                 ` Paul Jackson
@ 2004-10-03  3:25                 ` Paul Jackson
  2 siblings, 0 replies; 53+ messages in thread
From: Paul Jackson @ 2004-10-03  3:25 UTC (permalink / raw)
  To: Hubertus Franke
  Cc: akpm, mef, nagar, ckrm-tech, efocht, mbligh, lse-tech, hch,
	steiner, jbarnes, sylvain.jeaugey, djh, linux-kernel, colpatch,
	Simon.Derr, ak, sivanich, llp

Hubertus wrote:
> So to me cpumem sets as as concept is useful, so I won't be doing that 
> whopping, but ...

I couldn't parse the above ... could you rephrase?

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-03  2:49                 ` Paul Jackson
@ 2004-10-03 12:19                   ` Hubertus Franke
  0 siblings, 0 replies; 53+ messages in thread
From: Hubertus Franke @ 2004-10-03 12:19 UTC (permalink / raw)
  To: Paul Jackson
  Cc: akpm, mef, nagar, ckrm-tech, efocht, mbligh, lse-tech, hch,
	steiner, jbarnes, sylvain.jeaugey, djh, linux-kernel, colpatch,
	Simon.Derr, ak, sivanich, llp



Paul Jackson wrote:
> Hubertus wrote:
> 
>>CKRM could do so. We already provide the name space and the class 
>>hierarchy.
> 
> 
> Just because two things have name spaces and hierarchies, doesn't
> make them interchangeable.  Name spaces and hierarchies are just
> implementation mechanisms - many interesting, entirely unrelated,
> solutions make use of them.
> 
> What are the objects named, and what is the relation underlying
> the hierarchy?  These must match up.

Object name relationships are established through the rcfs pathname.

> 
> The objects named in cpusets are subsets of a systems CPUs and Memory
> Nodes. The relation underlying the hierarchy is the subset relation on
> these sets: if one cpuset node is a descendent of another, then its
> CPUs and Memory Nodes are a subset of the others.

Exactly, the controller will enforce that in the same way we
enforce other attributes and shares.
Example, we make sure that the sum of the share "guarantees" for
all children does not exceed the total_guarantee (i.e. denominator)
of the parent.
Nothing prohibits the controller to enforce the set constraints
you describe above and reject requests that are not valid.
As I said before, ideally the controller would be the cpumem set
guts and RCFS would be the API to it.

That's what Andrew was asking for in case the requirement for
this functionality can/is made.

> 
> What is the corresponding statement for CKRM?
> 
> For CKRM to subsume cpusets, there must be an injective map from the
> above cpuset objects to CKRM objects, that preserves this subset
> relation on cpusets.
> 

See above.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-03  2:26               ` Paul Jackson
@ 2004-10-03 14:11                 ` Paul Jackson
  0 siblings, 0 replies; 53+ messages in thread
From: Paul Jackson @ 2004-10-03 14:11 UTC (permalink / raw)
  To: Paul Jackson
  Cc: akpm, frankeh, mef, nagar, ckrm-tech, efocht, mbligh, lse-tech,
	hch, steiner, jbarnes, sylvain.jeaugey, djh, linux-kernel,
	colpatch, Simon.Derr, ak, sivanich, llp

Paul wrote:
> It's a requirement, I say.  It's a requirement.  Let the slapping begin ;).

Granted, to give Andrew his due (begrudgingly ;), the requirement
to pin processes on CPUs is a requirement of the _implementation_,
which follows, for someone familiar with the art, from the two
items:
  1) The requirement of the _user_ that runtimes be repeatable
     within perhaps 1% to 5% for a certain class of job, plus
  2) The cantankerous properties of big honkin NUMA boxes.

Clearly, Andrew was looking for _user_ requirements, to which I
managed somewhat unwittingly to back up in my user case scenario.

I suspect that there is a second user case scenario, with which the Bull
or NEC folks might be more familiar with than I, that can seemingly lead
to the same implementation requirement to pin jobs.  This scenario would
involve a customer who has paid good money for some compute capacity
(CPU cycles and Memory pages) with a certain guaranteed Quality of
Service, and who would prefer to see this capacity go to waste when
underutilized rather than risk it being unavailable in times of need.

However in this case, as Andrew is likely already chomping at the bit to
tell me, CKRM could provide such guaranteed compute capacities without
pinning.

Whether or not a CKRM class would sell to the customers of Bull and
NEC in lieu of a set of pinned nodes, I have no clue.

  Erich, Simon - Can you introduce a note of reality into my
		 speculations above?

The third user case scenario that commonly leads us to pinning is
support of the batch or workload managers, PBS and LSF, which are fond
of dividing the compute resources up into identifiable subsets of CPUs
and Memory Nodes that are near to each other (in terms of the NUMA
topology) and that have the size (compute capacity as measured in free
cycles and freely available ram) requested by a job, then attaching that
job to that subset and running it.

In this third case, batch or workload managers have a long history with
big honkin SMP and NUMA boxes, and this remains an important market for
them.  Consistent runtimes are valued by their customers and are a key
selling point of these products in the HPC market.  So this third case
reduces to the first, with its implementation requirement for pinning
the tasks of an active job to specific CPUs and Memory Nodes.

For example from Platform's web site (the vendor of LSF) at:
    http://www.platform.com/products/HPC
the benefits for their LSF HPC product include:
  * Guaranteed consistent and reliable parallel workload processing with
    high performance interconnect support
  * Maximized application performance with topology-aware scheduling
  * Ensures application runtime consistency by automatically allocating
    similar processors 

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-04 15:03                               ` Martin J. Bligh
@ 2004-10-04 15:53                                 ` Paul Jackson
  2004-10-04 18:17                                   ` Martin J. Bligh
  2004-10-05  9:26                                 ` Simon Derr
  1 sibling, 1 reply; 53+ messages in thread
From: Paul Jackson @ 2004-10-04 15:53 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: pwil3058, frankeh, dipankar, akpm, ckrm-tech, efocht, lse-tech,
	hch, steiner, jbarnes, sylvain.jeaugey, djh, linux-kernel,
	colpatch, Simon.Derr, ak, sivanich

Martin writes:
> OK, then your "exclusive" cpusets aren't really exclusive at all, since
> they have other stuff running in them.

What's clear is that 'exclusive' is not a sufficient precondition for
whatever it is that CKRM needs to have sufficient control.

Instead of trying to wrestle 'exclusive' into doing what you want, do me
a favor, if you would.  Help me figure out what conditions CKRM _does_
need to operate within a cpuset, and we'll invent a new property that
satisfies those conditions.

See my earlier posts in the last hour for my efforts to figure out what
these conditions might be.  I conjecture that it's something along the
lines of:

    Assuring each CKRM instance that it has control of some
    subset of a system that's separate and non-overlapping,
    with all Memory, CPU, Tasks, and Allowed masks of said
    Tasks either wholly owned by that CKRM instance, or
    entirely outside.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-04 15:53                                 ` [ckrm-tech] " Paul Jackson
@ 2004-10-04 18:17                                   ` Martin J. Bligh
  2004-10-04 20:25                                     ` Paul Jackson
  0 siblings, 1 reply; 53+ messages in thread
From: Martin J. Bligh @ 2004-10-04 18:17 UTC (permalink / raw)
  To: Paul Jackson
  Cc: pwil3058, frankeh, dipankar, akpm, ckrm-tech, efocht, lse-tech,
	hch, steiner, jbarnes, sylvain.jeaugey, djh, linux-kernel,
	colpatch, Simon.Derr, ak, sivanich

--On Monday, October 04, 2004 08:53:27 -0700 Paul Jackson <pj@sgi.com> wrote:

> Martin writes:
>> OK, then your "exclusive" cpusets aren't really exclusive at all, since
>> they have other stuff running in them.
> 
> What's clear is that 'exclusive' is not a sufficient precondition for
> whatever it is that CKRM needs to have sufficient control.
> 
> Instead of trying to wrestle 'exclusive' into doing what you want, do me
> a favor, if you would.  Help me figure out what conditions CKRM _does_
> need to operate within a cpuset, and we'll invent a new property that
> satisfies those conditions.

Oh, I'm not even there yet ... just thinking about what cpusets needs
independantly to operate efficiently - I don't think cpus_allowed is efficient.

Whatever we call it, the resource management system definitely needs the 
ability to isolate a set of resources (CPUs, RAM) totally dedicated to
one class or group of processes. That's what I see as the main feature
of cpusets right now, though there may be other things there as well that
I've missed? At least that's the main feature I personally see a need for ;-)

> See my earlier posts in the last hour for my efforts to figure out what
> these conditions might be.  I conjecture that it's something along the
> lines of:
> 
>     Assuring each CKRM instance that it has control of some
>     subset of a system that's separate and non-overlapping,
>     with all Memory, CPU, Tasks, and Allowed masks of said
>     Tasks either wholly owned by that CKRM instance, or
>     entirely outside.

Mmm. Looks like you're trying to do multiple CKRMs, one inside each cpuset,
right? Not sure that's the way I'd go, but maybe it makes sense.

The way I'm looking at it, which is probably wholly insufficient, if not
downright wrong, we have multiple process groups, each of which gets some 
set of resources. Those resources may be dedicated to that class (a la 
cpusets) or not. One could view this as a set of resource groupings, and
set of process groupings, where one or more process groupings is bound to
a resource grouping.

The resources are cpus & memory, mainly, in my mind (though I guess IO,
etc fit too). The resource sets are more like cpusets, and the process
groups a bit more like CKRM, except they seem to overlap (to me) when
the sets in cpusets are non-exclusive, or when CKRM wants harder performance
guarantees.

Feel free to point out where I'm full of shit / missing the point ;-)

M.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-04 18:17                                   ` Martin J. Bligh
@ 2004-10-04 20:25                                     ` Paul Jackson
  2004-10-04 22:15                                       ` Martin J. Bligh
  0 siblings, 1 reply; 53+ messages in thread
From: Paul Jackson @ 2004-10-04 20:25 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: pwil3058, frankeh, dipankar, akpm, ckrm-tech, efocht, lse-tech,
	hch, steiner, jbarnes, sylvain.jeaugey, djh, linux-kernel,
	colpatch, Simon.Derr, ak, sivanich

Martin wrote:
> Mmm. Looks like you're trying to do multiple CKRMs, one inside each cpuset,
> right? Not sure that's the way I'd go, but maybe it makes sense.

No - I was just reflecting my lack of adequate understanding of CKRM.

You guys were trying to get certain semantics out of cpusets to meet
your needs, putting words in my mouth as to what things like "exclusive"
meant, and I was pushing back, trying to get a fair, implementation
neutral statement of just what it was that CKRM needed out of cpusets,
by in part phrasing things in terms of what I thought you were trying to
have CKRM do with cpusets.  Turns out I speak CKRM substantially worse
than you guys speak cpusets. <grin>

So nevermind what I was trying to do, which was, as you guessed:
> 
> Looks like you're trying to do multiple CKRMs, one inside each cpuset,

Let me try again to see if I can figure out what you're trying to do.

You write:
>
> The way I'm looking at it, which is probably wholly insufficient, if not
> downright wrong, we have multiple process groups, each of which gets some 
> set of resources. Those resources may be dedicated to that class (a la 
> cpusets) or not. One could view this as a set of resource groupings, and
> set of process groupings, where one or more process groupings is bound to
> a resource grouping.
> 
> The resources are cpus & memory, mainly, in my mind (though I guess IO,
> etc fit too). The resource sets are more like cpusets, and the process
> groups a bit more like CKRM, except they seem to overlap (to me) when
> the sets in cpusets are non-exclusive, or when CKRM wants harder performance
> guarantees.

I can understand it far enough to see groups of processes using groups
of resources (cpus & memory, like cpusets).  Both of the phrases
containing "CKRM" in them go right past ... whizz.  And I'm a little
fuzzy on what are the sets, invariants, relations, domains, ranges,
operations, pre and post conditions and such that could be modeled in a
more precise manner.

Keep talking ...  Perhaps an example, along the lines of my "use case
scenarios", would help.  When we start losing each other trying to
generalize too fast, it can help to make up an overly concrete example,
to get things grounded again.

> Whatever we call it, the resource management system definitely needs the 
> ability to isolate a set of resources (CPUs, RAM) totally dedicated to
> one class or group of processes.

Not always "totally isolated and dedicated".

Here's a scenario that shows up some uses for "non-exclusive" cpusts.

Let's take my big 256 CPU system, divided into portions of 128, 64 and
64. At this level, these are three, mutually exclusive cpusets, and
interaction between them is minimized.  In the first two portions, the
128 and the first 64, a couple of "company jewel" applications run.
These are highly tuned, highly parallel applications that are sucking up
99% of every CPU cycle, bus cycle, cache line and memory page available,
for hours on end, in a closely synchronized dance.  They cannot tolerate
anything else interfering in their area.  Frankly, they have little use
for CKRM, fancy schedulers or sophisticated allocators.  They know
what's there, it's all their's, and they know exactly what they want to
do with it.  Get out of the way and let them do their job.  Industrial
strength computing at its finest.

Ok that much is as before.

Now the last portion, the second 64, is more of a general use area. It
is less fully utilized, and it's job mix more varied and less tightly
administered.  There's some 64-thread background application that puts a
fairly light load on things, running day and night (maybe the V.P. of
the MIS shop is a fan of SETI).

Since this is a parallel programming shop, people show up with at random
hours with smaller parallel jobs, carve off temporary cpusets of the
appropriate size, and run an application in them.  Their threads and
memory within their temporary cpuset are carefully placed, relative to
their cpuset, but they are not fully utilizing the nodes on which they
are running and they tolerate other things happening on the same nodes. 
Perhaps the other stuff doesn't impact their performance much, or
perhaps they are too poor to pay for dedicated nodes (grad students
still looking for a grant?) ... whatever.

They may well make good use of a batch manager, to which they submit
jobs of a specified size (cpus and memory) so that the batch manager can
smooth out the load. and avoid periods of excess idling or thrashing. 
The implementation of the batch manager relies heavily on the underlying
cpuset facility to manage various subsets of CPU and Memory Nodes.  The
batch manager might own the first 192 CPUs on the system too, but most
users never get to see that part of the system.

Within that last 64 portion the current mechanisms, including the per
task cpus_allowed and mems_allowed, and the current schedulers and
allocators, may well be doing a pretty good job.  Sure, there is an
element of chaos and things aren't perfect.  It's the "usual" timeshare
environment with a varied load mix.

The enforced placement within the smaller nested non-exclusive cpusets
probably surprises the scheduler and allocator at times, leading to
unfair inbalances.  I imagine that if CKRM just had that last 64 portion
to manage, and this was just a 64 CPU system, not a 256, then CKRM could
do a pretty good job of managing the systems resources.

Enough of this story ...

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-04 20:25                                     ` Paul Jackson
@ 2004-10-04 22:15                                       ` Martin J. Bligh
  2004-10-05  9:17                                         ` Paul Jackson
  0 siblings, 1 reply; 53+ messages in thread
From: Martin J. Bligh @ 2004-10-04 22:15 UTC (permalink / raw)
  To: Paul Jackson
  Cc: pwil3058, frankeh, dipankar, akpm, ckrm-tech, efocht, lse-tech,
	hch, steiner, jbarnes, sylvain.jeaugey, djh, linux-kernel,
	colpatch, Simon.Derr, ak, sivanich

>> The way I'm looking at it, which is probably wholly insufficient, if not
>> downright wrong, we have multiple process groups, each of which gets some 
>> set of resources. Those resources may be dedicated to that class (a la 
>> cpusets) or not. One could view this as a set of resource groupings, and
>> set of process groupings, where one or more process groupings is bound to
>> a resource grouping.
>> 
>> The resources are cpus & memory, mainly, in my mind (though I guess IO,
>> etc fit too). The resource sets are more like cpusets, and the process
>> groups a bit more like CKRM, except they seem to overlap (to me) when
>> the sets in cpusets are non-exclusive, or when CKRM wants harder performance
>> guarantees.
> 
> I can understand it far enough to see groups of processes using groups
> of resources (cpus & memory, like cpusets).  Both of the phrases
> containing "CKRM" in them go right past ... whizz.  And I'm a little
> fuzzy on what are the sets, invariants, relations, domains, ranges,
> operations, pre and post conditions and such that could be modeled in a
> more precise manner.
> 
> Keep talking ...  Perhaps an example, along the lines of my "use case
> scenarios", would help.  When we start losing each other trying to
> generalize too fast, it can help to make up an overly concrete example,
> to get things grounded again.

Let me make one thing clear: I don't work on CKRM ;-) So I'm not either
desperately familiar with it, or partial to it. Nor am I desperately
infatuated enough with my employer to believe  that just because they're
involved with it, it must be stunningly brilliant. So I think I'm actually
fairly impartial ... and balanced in ignorance on both sides ;-)

I do think both things are solving perfectly valid problems (that IMO
intersect) ... not sure whether either is doing it the best way though ;-).

>> Whatever we call it, the resource management system definitely needs the 
>> ability to isolate a set of resources (CPUs, RAM) totally dedicated to
>> one class or group of processes.
> 
> Not always "totally isolated and dedicated".
> 
> Here's a scenario that shows up some uses for "non-exclusive" cpusts.
> 
> Let's take my big 256 CPU system, divided into portions of 128, 64 and
> 64. At this level, these are three, mutually exclusive cpusets, and
> interaction between them is minimized.  In the first two portions, the
> 128 and the first 64, a couple of "company jewel" applications run.
> These are highly tuned, highly parallel applications that are sucking up
> 99% of every CPU cycle, bus cycle, cache line and memory page available,
> for hours on end, in a closely synchronized dance.  They cannot tolerate
> anything else interfering in their area.  Frankly, they have little use
> for CKRM, fancy schedulers or sophisticated allocators.  They know
> what's there, it's all their's, and they know exactly what they want to
> do with it.  Get out of the way and let them do their job.  Industrial
> strength computing at its finest.
> 
> Ok that much is as before.
> 
> Now the last portion, the second 64, is more of a general use area. It
> is less fully utilized, and it's job mix more varied and less tightly
> administered.  There's some 64-thread background application that puts a
> fairly light load on things, running day and night (maybe the V.P. of
> the MIS shop is a fan of SETI).
> 
> Since this is a parallel programming shop, people show up with at random
> hours with smaller parallel jobs, carve off temporary cpusets of the
> appropriate size, and run an application in them.  Their threads and
> memory within their temporary cpuset are carefully placed, relative to
> their cpuset, but they are not fully utilizing the nodes on which they
> are running and they tolerate other things happening on the same nodes. 
> Perhaps the other stuff doesn't impact their performance much, or
> perhaps they are too poor to pay for dedicated nodes (grad students
> still looking for a grant?) ... whatever.

OK, the dedicated stuff in cpusets makes a lot of sense to me, for the
reasons you describe above. One screaming problem we have at the moment
is we can easily say "I want to bind myself to CPU X" but no way to say
"kick everyone else off it". That seems like a very real problem.

However, the non-dedicated stuff seems much more debateable, and where
the overlap with CKRM stuff seems possible to me. Do the people showing
up at random with smaller parallel jobs REALLY, REALLY care about the
physical layout of the machine? I suspect not, it's not the highly tuned
syncopated rhythm stuff you describe above. The "give me 1.5 CPUs worth
of bandwidth please" model of CKRM makes much more sense to me.

> They may well make good use of a batch manager, to which they submit
> jobs of a specified size (cpus and memory) so that the batch manager can
> smooth out the load. and avoid periods of excess idling or thrashing. 
> The implementation of the batch manager relies heavily on the underlying
> cpuset facility to manage various subsets of CPU and Memory Nodes.  The
> batch manager might own the first 192 CPUs on the system too, but most
> users never get to see that part of the system.
> 
> Within that last 64 portion the current mechanisms, including the per
> task cpus_allowed and mems_allowed, and the current schedulers and
> allocators, may well be doing a pretty good job.  Sure, there is an
> element of chaos and things aren't perfect.  It's the "usual" timeshare
> environment with a varied load mix.
> 
> The enforced placement within the smaller nested non-exclusive cpusets
> probably surprises the scheduler and allocator at times, leading to
> unfair inbalances.  I imagine that if CKRM just had that last 64 portion
> to manage, and this was just a 64 CPU system, not a 256, then CKRM could
> do a pretty good job of managing the systems resources.

Right - exactly. Sounds like we're actually pretty much on the same page
(by the time I'd finished your email ;-)). So whatever the interface we
have, the underlying mechanisms seem to have two fundamentals: dedicated
and non-decicated resources. cpusets seems to do a good job of dedicated
and I'd argue the interface of specifying physical resources is a bit
clunky for non-dedicated stuff. CKRM doesn't seem to tackle the dedicated
at all, but seems to have an easier way of doing the non-dedicated.

So personally what I'd like is to have a unified interface (and I care 
not a hoot which, or a new one altogether), that can specify dedicated
or non-decicated resources for groups of processes, and then have a
"cpusets-style" mechanism for the dedicated, and "CKRM-style" mechanism
for the non-dedicated. Not sure if that's exactly what Andrew was hoping
for, or the rest of you either ;-)

The whole discussion about multiple sched-domains, etc, we had earlier
is kind of just an implementation thing, but is a crapload easier to do
something efficient here if the bits caring about that stuff are only
dealing with dedicated resource partitions.

OK, now my email is getting as long as yours, so I'll stop ;-) ;-)

M.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-04 11:44 ` Rick Lindsley
@ 2004-10-04 22:46   ` Paul Jackson
  0 siblings, 0 replies; 53+ messages in thread
From: Paul Jackson @ 2004-10-04 22:46 UTC (permalink / raw)
  To: Rick Lindsley
  Cc: mbligh, pwil3058, frankeh, dipankar, akpm, ckrm-tech, efocht,
	lse-tech, hch, steiner, jbarnes, sylvain.jeaugey, djh,
	linux-kernel, colpatch, Simon.Derr, ak, sivanich

Good questions - thanks.

Rick wrote:
> So the examples you gave before were rather oversimplified, then?

Yes - they were.  Quite intentionally.

> some portion of that must be reserved for the "bootcpuset".  Would this
> be enforced by the kernel, or the administrator?

It's administrative.  You don't have to run your system this way.  The
kernel threads (both per-cpu and system-wide), as well as init and the
classic Unix daemons, can be left running in the root cpuset (see below
for what that is).  The kernel doesn't care.

It was the additional request for a CKRM friendly setup that led me to
point out that system-wide kernel threads could be confined to a
"bootcpuset".  Since bootcpuset is user level stuff, I hadn't mentioned
it before, on the kernel mailing list.

The more common reason for confining such kthreads and Unix daemons to a
bootcpuset are to minimize interactions between such tasks and important
applications.

> I might suggest a simpler approach.  As a matter of policy, at least one
> cpu must remain outside of cpusets so that system processes like init,
> getty, lpd, etc. have a place to run.

This is the same thing, in different words.  In my current cpuset
implemenation, _every_ task is attached to a cpuset.

What you call a cpu that "remains outside of cpusets" is the bootcpuset,
in my terms.

>     The tasks whose cpus_allowed is a strict _subset_ of cpus_online_map
>     need to be where they are.  These are things like the migration
>     helper threads, one for each cpu.  They get a license to violate
>     cpuset boundaries.
> 
> Literally, or figuratively?  (How do we recognize these tasks?)

I stated one critical word too vaguely.  Let me restate (s/tasks/kernel
threads/), then translate.

>     The kernel threads whose cpus_allowed is a strict _subset_ of cpus_online_map
>     need to be where they are.  These are things like the migration
>     helper threads, one for each cpu.  They get a license to violate
>     cpuset boundaries.

> Literally, or figuratively?  (How do we recognize these tasks?)

Literally.  The early (_very_ early) user level code that sets up the
bootcpuset, as requested by a configuration file in /etc, moves the
kthreads with a cpus_allowed >= what's online to the bootcpuset, but
leaves the kthreads with a cpus_allowed < online where they are, in the
root cpuset.

If you do a "ps -efl", look for the tasks early in the list whose
command names in something like "/2" (printf format "/%u").  These
are the kthreads that usually need to be pinned on a CPU.

But you don't need to do that - an early boot user utility does it
as part of setting up the bootcpuset.

> Will cpus in exclusive cpusets be asked to service interrupts?

The current cpuset implementation makes no effort to manage interrupts. 
To manage interrupts in relation to cpusets today, you'd have to use
some other means to control or determine where interrupts were going,
and then place your cpusets with that in mind.

>     So with my bootcpuset, the problem is reduced, to a few tasks
>     per CPU, such as the migration threads, which must remain pinned
>     on their one CPU (or perhaps on just the CPUs local to one Memory
>     Node).  These tasks remain in the root cpuset, which by the scheme
>     we're contemplating, doesn't get a sched_domain in the fancier
>     configurations.
> 
> You just confused me on many different levels:
> 
>     * what is the root cpuset? Is this the same as the "bootcpuset" you
>       made mention of?

Not the same.

The root cpuset is the all encompassing cpuset representing the entire
system, from which all other cpusets are formed.  The root cpuset always
contains all CPUs and all Memory Nodes.

The bootcpuset is typically a small cpuset, a direct child of the root
cpuset, containing what would be in your terms the one or a few cpus
that are reserved for the classic Unix system processes like init,
getty, lpd, etc.

>    * so where *do* these tasks go in the "fancier configurations"?

Er eh - in the root cpuset ;).  Hmmm ... guess that's not your question.

In this fancy configuration, I had the few kthreads that could _not_
be moved to the bootcpuset, because they had to remain pinned on
specific CPUs (e.g. the migration threads), remain in the root cpuset.

When the exclusive child cpusets were formed, and each given their own
special scheduler domain, I rebound the scheduler domain to use for
these per-cpu kthreads to which ever scheduler domain managed the cpu
that thread lived on.  The thread remained in the root cpuset, but
hitched a ride on the scheduler that had assumed control of the cpu that
the thread lived on.  Everything in this paragraphy is something I
invented in the last two days, in response to various requests from
others for setups that provided a clear boundary of control to
schedulers.

>     If we just wrote the code, and quit trying to find a grand unifying
>     theory to explain it consistently with the rest of our design,
>     it would probably work just fine.
> 
> I'll assume we're missing a smiley here.

Not really.  The per-cpu kthreads are a wart that doesn't fit the
particular design being discussed here very well.  Warts happen.

> When you "remove a cpuset" you just or in the right bits in everybody's
> cpus_allowed fields and they start migrating over.
> 
> To me, this all works for the cpu-intensive, gotta have it with 1% runtime
> variation example you gave.  Doesn't it?  And it seems to work for the
> department-needs-8-cpus-to-do-as-they-please example too, doesn't it?

What you're saying is rather like saying I don't need a file system
on my floppy disk.  Well, originally, I didn't.  I wrote the bytes
to my tape casette, I read them back.  What's the problem.  If I
wanted to name the bytes, I stuck a label on the cassette and wrote
a note on the label.

Yes, that works.  As systems get bigger, and as we add batch managers
and such to handle a more complicated set of jobs, we need to be able
to do things like:
   * name sets of CPUs/Memory, in a way consistent across the system
   * create and destroy a set
   * control who can query, modify and attach a set
   * change which set a task is attached to
   * list which tasks are currently attached to a set
   * query, set and change which CPUs and Memory are in a set.

This is like needing a FAT file system for your floppy.  Cpusets
join the collection of "first class, kernel managed" objects,
and are no longer just the implied attributes of each task.

Batch managers and sysadmins of more complex, dynamically changing
configurations, sometimes on very large systems that are shared across
several departments or divisions, depend on this ability to treat
cpusets as first class name, kernel managed objects.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-02 23:44                 ` Hubertus Franke
@ 2004-10-05  3:13                   ` Matthew Helsley
  2004-10-05  8:30                     ` Hubertus Franke
  0 siblings, 1 reply; 53+ messages in thread
From: Matthew Helsley @ 2004-10-05  3:13 UTC (permalink / raw)
  To: Hubertus Franke
  Cc: Peter Williams, dipankar, Paul Jackson, Andrew Morton, CKRM-Tech,
	efocht, Martin Bligh, lse-tech, hch, steiner, jbarnes,
	sylvain.jeaugey, djh, linux-kernel, Matthew Dobson, Simon.Derr,
	ak, sivanich

On Sat, 2004-10-02 at 16:44, Hubertus Franke wrote:
<snip>
> along cpuset boundaries. If taskclasses are allowed to span disjoint
> cpumemsets, what is then the definition of setting shares ?
<snip>

	I think the clearest interpretation is the share ratios are the same
but the quantity of "real" resources and the sum of shares allocated is
different depending on cpuset.

	For example, suppose we have taskclass/A that spans cpusets Foo and Bar
-- processes foo and bar are members of taskclass/A but in cpusets Foo
and Bar respectively. Both get up to 50% share of cpu time in their
respective cpusets because they are in taskclass/A. Further suppose that
cpuset Foo has 1 CPU and cpuset Bar has 2 CPUs.

	This means process foo could consume up to half a CPU while process bar
could consume up to a whole CPU. In order to enforce cpuset
partitioning, each class would then have to track its share usage on a
per-cpuset basis. [Otherwise share allocation in one partition could
prevent share allocation in another partition. Using the example above,
suppose process foo is using 45% of CPU in cpuset Foo. If the total
share consumption is calculated across cpusets process bar would only be
able to consume up to 5% of CPU in cpuset Bar.]

Cheers,
	-Matt Helsley


^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
@ 2004-10-05  6:05 Stan Hoeppner
  0 siblings, 0 replies; 53+ messages in thread
From: Stan Hoeppner @ 2004-10-05  6:05 UTC (permalink / raw)
  To: 'Paul Jackson', Martin J. Bligh
  Cc: pwil3058, frankeh, dipankar, akpm, ckrm-tech, efocht, lse-tech,
	hch, steiner, jbarnes, sylvain.jeaugey, djh, linux-kernel,
	colpatch, Simon.Derr, ak, sivanich

Rick Lindsley wrote:
Will cpus in exclusive cpusets be asked to service interrupts?

Paul Jackson wrote:
Let's take my big 256 CPU system, divided into portions of 128, 64 and
64. At this level, these are three, mutually exclusive cpusets, and
interaction between them is minimized.  In the first two portions, the
128 and the first 64, a couple of "company jewel" applications run.
These are highly tuned, highly parallel applications that are sucking up
99% of every CPU cycle, bus cycle, cache line and memory page available,
for hours on end, in a closely synchronized dance.  They cannot tolerate
anything else interfering in their area.  Frankly, they have little use
for CKRM, fancy schedulers or sophisticated allocators.  They know
what's there, it's all their's, and they know exactly what they want to
do with it.  Get out of the way and let them do their job.  Industrial
strength computing at its finest.

In Paul's example 256 Altix system here, there are 3 cpusets.  Assuming each
cpuset will consist of:
128 / 4 = 32 C-bricks
64  / 4 = 16 C-bricks
64  / 4 = 16 C-Bricks

Which C-bricks have P-bricks (PCI-X I/O) attached to them, and to which
P-bricks are the Gig-E cards and PCI-X fiber channel cards with TP9500 disk
arrays attached?  And thus, where is the network and disk I/O interrupt load
concentrated?  I would assume that all the network I/O will be on a single
"Interactive" C-brick (4 CPUs = 2 NUMA nodes).  In which cpuset are the
fiber channel cards/disk arrays concentrated?  Or are they physically spread
out evenly across the entire system for balance?  Does the system even use
"local" storage, or will there be a few C-bricks/nodes with P-bricks and
many fiber channel cards that talk to a CXFS SAN host?

If all the I/O is concentrated in one of these 3 cpusets, then it would make
sense to "lock" I/O interrupts to CPUs within that cpuset.  Additionally, if
a user is running an application in one of the *other* cpusets and he/she
needs "guaranteed rate I/O", I can see where something like the CKRM
framework would be needed within the "I/O heavy" cpuset.  So maybe CKRM
within a cpuset could assist in getting something like the IRIX XFS GRIO
feature into Altix?

Stan Hoeppner
TheHardwareFreak
stan@hardwarefreak.com

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-05  3:13                   ` [ckrm-tech] " Matthew Helsley
@ 2004-10-05  8:30                     ` Hubertus Franke
  2004-10-05 14:20                       ` Paul Jackson
  0 siblings, 1 reply; 53+ messages in thread
From: Hubertus Franke @ 2004-10-05  8:30 UTC (permalink / raw)
  To: Matthew Helsley
  Cc: Peter Williams, dipankar, Paul Jackson, Andrew Morton, CKRM-Tech,
	efocht, Martin Bligh, lse-tech, hch, steiner, jbarnes,
	sylvain.jeaugey, djh, linux-kernel, Matthew Dobson, Simon.Derr,
	ak, sivanich



Matthew Helsley wrote:

> On Sat, 2004-10-02 at 16:44, Hubertus Franke wrote:
> <snip>
> 
>>along cpuset boundaries. If taskclasses are allowed to span disjoint
>>cpumemsets, what is then the definition of setting shares ?
> 
> <snip>
> 
> 	I think the clearest interpretation is the share ratios are the same
> but the quantity of "real" resources and the sum of shares allocated is
> different depending on cpuset.
> 
> 	For example, suppose we have taskclass/A that spans cpusets Foo and Bar
> -- processes foo and bar are members of taskclass/A but in cpusets Foo
> and Bar respectively. Both get up to 50% share of cpu time in their
> respective cpusets because they are in taskclass/A. Further suppose that
> cpuset Foo has 1 CPU and cpuset Bar has 2 CPUs.

Yes, we ( Shailabh and I ) were talking about exactly that this 
afternoon. This would mean that the denominator of the cpu shares for a 
given class <cls> is not determined solely by the parents 
total_guarantee but by:
    total_guarantee * size(cls->parent->cpuset) / size(cls->cpuset)

This is effectively what you describe below.

> 
> 	This means process foo could consume up to half a CPU while process bar
> could consume up to a whole CPU. In order to enforce cpuset
> partitioning, each class would then have to track its share usage on a
> per-cpuset basis. [Otherwise share allocation in one partition could
> prevent share allocation in another partition. Using the example above,
> suppose process foo is using 45% of CPU in cpuset Foo. If the total
> share consumption is calculated across cpusets process bar would only be
> able to consume up to 5% of CPU in cpuset Bar.]
> 

This would require some changes in the CPU scheduler to teach the 
cpu-monitor to deal with the limited scope. It would also require some
mods to the API :
Since classes can span different cpu sets with different shares
how do we address the cpushare of a class in the particular context
of a cpu-set.
Alternatively, one could require that classes can not span different
cpu-sets, which would significantly reduce the complexity of this.

> Cheers,
> 	-Matt Helsley
> 
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
> Use IT products in your business? Tell us what you think of them. Give us
> Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
> http://productguide.itmanagersjournal.com/guidepromo.tmpl
> _______________________________________________
> ckrm-tech mailing list
> https://lists.sourceforge.net/lists/listinfo/ckrm-tech
> 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-04 22:15                                       ` Martin J. Bligh
@ 2004-10-05  9:17                                         ` Paul Jackson
  2004-10-05 10:01                                           ` Paul Jackson
  2004-10-05 22:24                                           ` Matthew Dobson
  0 siblings, 2 replies; 53+ messages in thread
From: Paul Jackson @ 2004-10-05  9:17 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: pwil3058, frankeh, dipankar, akpm, ckrm-tech, efocht, lse-tech,
	hch, steiner, jbarnes, sylvain.jeaugey, djh, linux-kernel,
	colpatch, Simon.Derr, ak, sivanich

Martin wrote:
> Let me make one thing clear: I don't work on CKRM ;-) 

ok ...

Indeed, unless I'm not recognizing someone's expertise properly, there
seems to be a shortage of the CKRM experts on this thread.

Who am I missing ...

> However, the non-dedicated stuff seems much more debateable, and where
> the overlap with CKRM stuff seems possible to me. Do the people showing
> up at random with smaller parallel jobs REALLY, REALLY care about the
> physical layout of the machine? I suspect not, it's not the highly tuned
> syncopated rhythm stuff you describe above. The "give me 1.5 CPUs worth
> of bandwidth please" model of CKRM makes much more sense to me.

It will vary.  In shops that are doing alot of highly parallel work,
such as with OpenMP or MPI, many smaller parallel jobs will also be
placement sensitive.  The performance of such jobs is hugely sensitive
to their placement and scheduling on dedicated CPUs and Memory, one per
active thread.

These shops will often use a batch scheduler or workload manager, such
as PBS or LSF to manage their jobs.  PBS and LSF make a business of
defining various sized cpusets to fit the queued jobs, and running each
job in a dedicated cpuset.  Their value comes from obtaining high
utilization, and optimum repeatable runtimes, on a varied input job
stream, especially of placement sensitive jobs.  The feature set of
cpusets was driven as much as anything by what was required to support a
port of PBS or LSF.

> I'd argue the interface of specifying physical resources is a bit
> clunky for non-dedicated stuff.

Likeky so - the interface is expected to be wrapped with a user level
'cpuset' library, which converts it to a 'C' friendly model.  And that
in turn is expected to be wrapped with a port of LSF or PBS, which
converts placement back to something that the customer finds familiar
and useful for managing their varied job mix.

I don't expect admins at HPC shops to spend much time poking around the
/dev/cpuset file system, though it is a nice way to look around and
figure out how things work.

The /dev/cpuset pseudo file system api was chosen because it was
convenient for small scale work, learning and experimentation, because
it was a natural for the hierarchical name space with permissions that I
required, and because it was convenient to leverage existing vfs
structure in the kernel.

> So personally what I'd like is to have a unified interface
> ...
> Not sure if that's exactly what Andrew was hoping
> for, or the rest of you either ;-)

Well, not what I'm pushing for, that's for sure.

We really have two different mechanisms here:

  1) A placement mechanism, explicitly specifying what CPUs and Memory
     Nodes are allowed, and
  2) A sharing mechanism, specifying what proportion of fungible
     resources as cpu cycles, page faults, i/o requests a particular
     subset (class) of the user population is to receive.

If you look at the very lowest level hooks for cpusets and CKRM, you
will see the essential difference:

  1) cpusets hooks the scheduler to prohibit scheduling on a CPU that
     is not allowed, and the allocator to prohibit obtaining memory
     on a Node that is not allowed.
  2) CKRM hooks these and other places to throttle tasks by inserting
     small delays, so as to obtain the requested share or percentage,
     per class of user, of the rate of usage of fungible resources.

The specific details which must be passed back and forth across the
boundary between the kernel and user-space for these two mechanisms are
simply different.  One controls which of a list of enumerable finite
non-substitutable resources may or may not be used, and the other
controls what share of other anonymous, fungible resources may be used.

Looking for a unified interface is a false economy in my view, and I
am suspicious that such a search reflects a failure to recognize the
essential differences between the two mechanisms.

> The whole discussion about multiple sched-domains, etc, we had earlier
> is kind of just an implementation thing, but is a crapload easier to do
> something efficient here if the bits caring about that stuff are only
> dealing with dedicated resource partitions.

Yes - much easier.  I suspect that someday I will have to add to cpusets
the ability to provide, for select cpusets, the additional guarantees
(sole and exclusive ownership of all the CPUs, Memory Nodes, Tasks and
affinity masks therein) which a scheduler or allocator that's trying to
be smart requires to avoid going crazy.  Not all cpusets need this - but
those cpusets which define the scope of scheduler or allocator domain
would sure like it.  Whatever my exclusive flag means now, I'm sure we
all agree that it is too weak to meet this particular requirement.

> OK, now my email is getting as long as yours, so I'll stop ;-) ;-)

That would be tragic indeed.  Good thing you stopped.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-05  9:17                                         ` Paul Jackson
@ 2004-10-05 10:01                                           ` Paul Jackson
  2004-10-05 22:24                                           ` Matthew Dobson
  1 sibling, 0 replies; 53+ messages in thread
From: Paul Jackson @ 2004-10-05 10:01 UTC (permalink / raw)
  To: Paul Jackson
  Cc: mbligh, pwil3058, frankeh, dipankar, akpm, ckrm-tech, efocht,
	lse-tech, hch, steiner, jbarnes, sylvain.jeaugey, djh,
	linux-kernel, colpatch, Simon.Derr, ak, sivanich

> Who am I missing ...

Oops - hi, Hubertus ;).

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-05  8:30                     ` Hubertus Franke
@ 2004-10-05 14:20                       ` Paul Jackson
  0 siblings, 0 replies; 53+ messages in thread
From: Paul Jackson @ 2004-10-05 14:20 UTC (permalink / raw)
  To: Hubertus Franke
  Cc: matthltc, pwil3058, dipankar, akpm, ckrm-tech, efocht, mbligh,
	lse-tech, hch, steiner, jbarnes, sylvain.jeaugey, djh,
	linux-kernel, colpatch, Simon.Derr, ak, sivanich

Hubertus writes:
> Since classes can span different cpu sets with different shares
> how do we address the cpushare of a class in the particular context
> of a cpu-set.
> Alternatively, one could require that classes can not span different
> cpu-sets, which would significantly reduce the complexity of this.

It's not just cpusets that sets a tasks cpus_allowed ...

Lets say we have a 16 thread OpenMP application, running on a cpuset of
16 CPUs on a large system, one thread pinned to each CPU of the 16 using
sched_setaffinity, running exclusively there.  Which means that there
are perhaps eight tasks pinned on each of those 16 CPUs, the one OpenMP
thread, and perhaps seven indigenous per-cpu kernel threads:
    migration, ksoftirq, events, kblockd, aio, xfslogd and xfsdatad
(using what happens to be on a random 2.6 Altix in front of me).

Then the classe(s) containing the eight tasks on any given one of these
CPUs would be required to not contain any other tasks outside of those
eight, by your reduced complexity alternative, right?

On whom/what would this requirement be imposed?  Hopefully some CKRM
classification would figure this out and handle the classification
automatically.

What of the couple of "mother" tasks in this OpenMP application, which
are in this same 16 CPU cpuset, probably pinned to all 16 of the CPUs,
instead of to any individual one of them?  What are the requirements on
the classes to which these tasks belong, in relation to the above
classes for the per-cpu kthreads and per-cpu OpenMP threads?  And on
what person/software is the job of adapting to these requirements
imposed?

Observe by the way that so long as:
 1) the per-cpu OpenMP threads each get to use 99+% of their
    respective CPUs,
 2) CKRM didn't impose any constraints or work on anything else

then what CKRM does here doesn't matter.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-05  9:17                                         ` Paul Jackson
  2004-10-05 10:01                                           ` Paul Jackson
@ 2004-10-05 22:24                                           ` Matthew Dobson
  1 sibling, 0 replies; 53+ messages in thread
From: Matthew Dobson @ 2004-10-05 22:24 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Martin J. Bligh, pwil3058, frankeh, dipankar, Andrew Morton,
	ckrm-tech, efocht, LSE Tech, hch, steiner, Jesse Barnes,
	sylvain.jeaugey, djh, LKML, Simon.Derr, Andi Kleen, sivanich

On Tue, 2004-10-05 at 02:17, Paul Jackson wrote:
> The /dev/cpuset pseudo file system api was chosen because it was
> convenient for small scale work, learning and experimentation, because
> it was a natural for the hierarchical name space with permissions that I
> required, and because it was convenient to leverage existing vfs
> structure in the kernel.

I really like the /dev/cpuset FS.  I would like to leverage most of that
code to be the user level interface to creating, linking & destroying
sched_domains at some point.  This, of course, is assuming that the
dynamic sched_domains concept meets with something less than catcalls
and jeers... ;)

-Matt


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-06 23:12                                       ` Matthew Dobson
@ 2004-10-07  8:59                                         ` Paul Jackson
  0 siblings, 0 replies; 53+ messages in thread
From: Paul Jackson @ 2004-10-07  8:59 UTC (permalink / raw)
  To: colpatch
  Cc: Simon.Derr, mbligh, pwil3058, frankeh, dipankar, akpm, ckrm-tech,
	efocht, lse-tech, hch, steiner, jbarnes, sylvain.jeaugey, djh,
	linux-kernel, ak, sivanich

Matthew wrote:
> > Perhaps these flags should be called:
> > 	mems_exclusive_precursor
> > 	cpus_exclusive_precursor
> > ;).
> 
> Ok...  So if we could offer the 'real' exclusion that the PBS and LSF
> workload managers offer directly, would that suffice?  Meaning, could we
> make PBS and LSF work on top of in-kernel mechanisms that offer 'real'
> exclusion.  'Real' exclusion defined as isolated groups of CPUs and
> memory that the kernel can guarantee will not run other processes?  That
> way we can get the job done without having to rely on these external
> workload managers, and be able to offer this dynamic partitioning to all
> users.  Thoughts?


I agree entirely.  Before when I was being a penny pincher about
how much went in the kernel, it might have made sense to have
the mems_exclusive and cpus_exclusive precursor flags.

But now that we have demonstrated a bone fide need for a really
really exclusive cpuset, it was silly of me to consider offering:

> > 	mems_exclusive_precursor
> > 	cpus_exclusive_precursor
> >     really_really_exclusive

These multiple flavors just confuse and annoy.

You're right.  Just one flag option, for the really exclusive cpuset,
is required here.

A different scheduler domain (whether same scheduler with awareness of
the boundaries, or something more substantially distinct) may only be
attached to a cpuset if it is exclusive.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-06 23:21                           ` Matthew Dobson
@ 2004-10-07  9:41                             ` Paul Jackson
  0 siblings, 0 replies; 53+ messages in thread
From: Paul Jackson @ 2004-10-07  9:41 UTC (permalink / raw)
  To: colpatch
  Cc: mbligh, pwil3058, frankeh, dipankar, akpm, ckrm-tech, efocht,
	lse-tech, hch, steiner, jbarnes, sylvain.jeaugey, djh,
	linux-kernel, Simon.Derr, ak, sivanich

Matt wrote:
> I'm really glad to hear that, Paul.  That unconstrained (ab)use was my
> only real concern with the cpusets patches.  I look forward to massaging
> our two approaches into something that will satisfy all interested
> parties.

Sounds good.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-07 19:52                                                       ` Paul Jackson
@ 2004-10-07 21:04                                                         ` Matthew Helsley
  0 siblings, 0 replies; 53+ messages in thread
From: Matthew Helsley @ 2004-10-07 21:04 UTC (permalink / raw)
  To: Paul Jackson; +Cc: CKRM-Tech

On Thu, 2004-10-07 at 12:52, Paul Jackson wrote:
<snip>
> Also the other means to poke the affinity masks, sched_setaffinity,
> mbind and set_mempolicy, need to be constrained to respect cpuset
> boundaries and honor exclusion.  I doubt you want them calling out to a
> user daemon either.
> 
> And the memory affinity mask, mems_allowed, seems to require updating
> within the current task context.  Perhaps someone else is smart enough
> to see an alternative, but I could not find a safe way to update this
> from outside the current context.  So it's updated on the path going
> into __alloc_pages().  I doubt you want a patch that calls out to my
> daemon on each call into __alloc_pages().
<snip>

	Just a thought: could a system-wide ld preload of some form be useful
here? You could use preload to add wrappers around the necessary calls
(you'd probably want to do this in /etc/ld.so.preload). Then have those
wrappers communicate with a daemon or open some /etc config files that
describe the topology you wish to enforce.

Cheers,
	-Matt Helsley


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-07 19:05 ` Rick Lindsley
@ 2004-10-10  2:15   ` Paul Jackson
  2004-10-11 22:06     ` Matthew Dobson
  2004-10-10  2:28   ` Paul Jackson
  1 sibling, 1 reply; 53+ messages in thread
From: Paul Jackson @ 2004-10-10  2:15 UTC (permalink / raw)
  To: Rick Lindsley
  Cc: colpatch, mbligh, Simon.Derr, pwil3058, frankeh, dipankar, akpm,
	ckrm-tech, efocht, lse-tech, hch, steiner, jbarnes,
	sylvain.jeaugey, djh, linux-kernel, ak, sivanich

Rick replying to Paul:
> > But doesn't CKRM provide a way to control what percentage of the
> > compute cycles are available from a pool of cycles?
> > 
> > And don't cpusets provide a way to control which physical CPUs a
> > task can or cannot use?
> 
> Right.

I am learning (see other messages of the last couple days on this
thread) that CKRM is supposed to be a general purpose workload manager
framework, and that fair share scheduling (managing percentage of
compute cycles) just happens to be the first instance of such a manager.

> And what I'm hearing is that if you're a job running in a set of shared
> resources (i.e., non-exclusive) then by definition you are *not* a job
> who cares about which processor you run on.  I can't think of a situation
> where I'd care about the physical locality, and the proximity of memory
> and other nodes, but NOT care that other tasks might steal my cycles.

There are at least these situations:
 1) proximity to special hardware (graphics, networking, storage, ...)
 2) non-dedicated tightly coupled multi-threaded apps (OpenMP, MPI)
 3) batch managers switching resources between jobs

On (2), if say you want to run eight copies of an application, on a
system that only has eight CPUs, where each copy of the app is an
eight-way tightly coupled app, they will go much faster if each app is
placed across all 8 CPUs, one thread per CPU, than if they are placed
willy-nilly.  Or a bit more realistically, if you have a random input
queue of such tightly coupled apps, each with a predetermined number of
threads between one and eight, you will get more work done by pinning
the threads of any given app on different CPUs.  The users submitting
the jobs may well not care which CPUs are used for their job, but an
intermediate batch manager probably will care, as it may be solving the
knapsack problem of how to fit a stream of varying sized jobs onto a
given size of hardware.

On (3), a batch manager might say have two small cpusets, and also one
larger cpuset that is the two small ones combined.  It might run one job
in each of the two small cpusets for a while, then suspend these two
jobs, in order to run a third job in the larger cpuset.  The two small
cpusets don't go away while the third job runs -- you don't want to lose
or have to tear down and rebuild the detailed inter-cpuset placement of
the two small jobs while they are suspended.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-07 19:05 ` Rick Lindsley
  2004-10-10  2:15   ` [ckrm-tech] " Paul Jackson
@ 2004-10-10  2:28   ` Paul Jackson
  1 sibling, 0 replies; 53+ messages in thread
From: Paul Jackson @ 2004-10-10  2:28 UTC (permalink / raw)
  To: Rick Lindsley
  Cc: colpatch, mbligh, Simon.Derr, pwil3058, frankeh, dipankar, akpm,
	ckrm-tech, efocht, lse-tech, hch, steiner, jbarnes,
	sylvain.jeaugey, djh, linux-kernel, ak, sivanich

Rick wrote:
> One does?  No, in my world, there's constant auditing going on and if
> you can get away with having a machine idle, power to ya, but chances
> are somebody's going to come and take away at least the cycles and maybe

I don't doubt that such worlds as yours exist, nor that you live in one.

In some of the worlds my customers live in, they have been hit so many
times with the pains of performance degradation and variation due to
unwanted interaction between applications that they get nervous if a
supposedly unused CPU or Memory looks to be in use.  Just the common use
by Linux of unused memory to keep old pages in cache upsets them.

And, perhaps more to the point, while indeed some other department may
soon show up to make use of those lost cycles, the computer had jolly
well better leave those cycles lost _until_ the customer decides to use
them.

Unlike the computer in my dentists office, which should "just do it",
maximizing throughput as best it can, not asking any questions, the
computers in some of my customers high end shops are managed more tightly
(sometimes very tightly) and they expect to control load placement.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-07 18:25                                                     ` Andrew Morton
  2004-10-07 19:52                                                       ` Paul Jackson
@ 2004-10-10  3:22                                                       ` Paul Jackson
  1 sibling, 0 replies; 53+ messages in thread
From: Paul Jackson @ 2004-10-10  3:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mbligh, Simon.Derr, colpatch, pwil3058, frankeh, dipankar,
	ckrm-tech, efocht, lse-tech, hch, steiner, jbarnes,
	sylvain.jeaugey, djh, linux-kernel, ak, sivanich

Andrew wrote:
> As you say, it's a matter of coordinated poking at cpus_allowed.  

No - I said I concluded that three years ago.  And then later learned
the hard way this wasn't enough.

See further my earlier (like 2.5 days and 2 boxes of Kleenex ago) reply
to this same post.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-07 14:49                                                 ` Martin J. Bligh
  2004-10-07 17:54                                                   ` Paul Jackson
@ 2004-10-10  5:12                                                   ` Paul Jackson
  1 sibling, 0 replies; 53+ messages in thread
From: Paul Jackson @ 2004-10-10  5:12 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Simon.Derr, colpatch, pwil3058, frankeh, dipankar, akpm,
	ckrm-tech, efocht, lse-tech, hch, steiner, jbarnes,
	sylvain.jeaugey, djh, linux-kernel, ak, sivanich

> That makes no sense to me whatsoever, I'm afraid. Why if they were allowed
> "to steal a few cycles" are they so fervently banned from being in there?

One substantial advantage of cpusets (as in the kernel patch in *-mm's
tree), over variations that "just poke the affinity masks from user
space" is the task->cpuset pointer.  This tracks to what cpuset a task
is attached.  The fork and exit code duplicates and nukes this pointer,
managing the cpuset reference counter.

It matters to batch schedulers and the like which cpuset a task is in,
and which tasks are in a cpuset, when it comes time to do things like
suspend or migrate the tasks currently in a cpuset.

Just because it's ok to share a little compute time in a cpuset doesn't
mean you don't care to know who is in it.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-10  2:15   ` [ckrm-tech] " Paul Jackson
@ 2004-10-11 22:06     ` Matthew Dobson
  2004-10-11 22:58       ` Paul Jackson
  2004-10-12  8:50       ` Simon Derr
  0 siblings, 2 replies; 53+ messages in thread
From: Matthew Dobson @ 2004-10-11 22:06 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Rick Lindsley, Martin J. Bligh, Simon.Derr, pwil3058, frankeh,
	dipankar, Andrew Morton, ckrm-tech, efocht, LSE Tech, hch,
	steiner, Jesse Barnes, sylvain.jeaugey, djh, LKML, Andi Kleen,
	sivanich

On Sat, 2004-10-09 at 19:15, Paul Jackson wrote:
> Rick replying to Paul:
> > And what I'm hearing is that if you're a job running in a set of shared
> > resources (i.e., non-exclusive) then by definition you are *not* a job
> > who cares about which processor you run on.  I can't think of a situation
> > where I'd care about the physical locality, and the proximity of memory
> > and other nodes, but NOT care that other tasks might steal my cycles.
> 
> There are at least these situations:
>  1) proximity to special hardware (graphics, networking, storage, ...)
>  2) non-dedicated tightly coupled multi-threaded apps (OpenMP, MPI)
>  3) batch managers switching resources between jobs
> 
> On (2), if say you want to run eight copies of an application, on a
> system that only has eight CPUs, where each copy of the app is an
> eight-way tightly coupled app, they will go much faster if each app is
> placed across all 8 CPUs, one thread per CPU, than if they are placed
> willy-nilly.  Or a bit more realistically, if you have a random input
> queue of such tightly coupled apps, each with a predetermined number of
> threads between one and eight, you will get more work done by pinning
> the threads of any given app on different CPUs.  The users submitting
> the jobs may well not care which CPUs are used for their job, but an
> intermediate batch manager probably will care, as it may be solving the
> knapsack problem of how to fit a stream of varying sized jobs onto a
> given size of hardware.
> 
> On (3), a batch manager might say have two small cpusets, and also one
> larger cpuset that is the two small ones combined.  It might run one job
> in each of the two small cpusets for a while, then suspend these two
> jobs, in order to run a third job in the larger cpuset.  The two small
> cpusets don't go away while the third job runs -- you don't want to lose
> or have to tear down and rebuild the detailed inter-cpuset placement of
> the two small jobs while they are suspended.

I think these situations, particularly the first two, are the times you
*want* to use the cpus_allowed mechanism.  Pinning a specific thread to
a specific processor (cases (1) & (2)) is *exactly* why the cpus_allowed
mechanism was put into the kernel.

And (3) can pretty easily be achieved by using a combination of
sched_domains and cpus_allowed.  In your example of one 4 CPU cpuset and
two 2 CPU sub cpusets (cpu-subsets? :), one could easily create a 4 CPU
domain for the larger job and two 2 CPU domains for the smaller jobs. 
Those 2 2 CPU subdomains can be created & destroyed at will, or they
could be simply tagged as "exclusive" when you don't want tasks moving
back and forth between them, and tagged as "non-exclusive" when you want
tasks to be freely balanced across all 4 CPUs in the larger parent
domain.

One of the cool thing about using sched_domains as your partitioning
element is that in reality, tasks run on *CPUs*, not *domains*.  So if
you have threads 'a1' & 'a2' running on CPUs 0 & 1 (small job 'a') and
threads 'b1' & 'b2' running on CPUs 2 & 3 (small job 'b'), you can
suspend threads a1, a2, b1 & b2 and remove the domains they were running
in to allow job A (big job with threads A1, A2, A3, & A4) to run on the
larger 4 CPU domain.  When you then suspend A1-A4 again to allow the
smaller jobs to proceed, you can pretty trivially create the 2 CPU
domains underneath the 4 CPU domain and resume the jobs.  Those jobs (a
& b) have been suspended on the CPUs they were originally running on,
and thus will resume on the same CPUs without any extra effort.  They
will simply run on those CPUs, and at load balance time, the domains
attached to those CPUs will be consulted to determine where the tasks
can be relocated to if there is a heavy load.  The domains will tell the
scheduler that the tasks cannot be relocated outside the 2 CPUs in each
respective domain.  Viola!  (sorta ;)

-Matt

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-11 22:06     ` Matthew Dobson
@ 2004-10-11 22:58       ` Paul Jackson
  2004-10-12 21:22         ` Matthew Dobson
  2004-10-12  8:50       ` Simon Derr
  1 sibling, 1 reply; 53+ messages in thread
From: Paul Jackson @ 2004-10-11 22:58 UTC (permalink / raw)
  To: colpatch
  Cc: ricklind, mbligh, Simon.Derr, pwil3058, frankeh, dipankar, akpm,
	ckrm-tech, efocht, lse-tech, hch, steiner, jbarnes,
	sylvain.jeaugey, djh, linux-kernel, ak, sivanich

Matthew wrote:
> One of the cool thing about using sched_domains as your partitioning
> element is that in reality, tasks run on *CPUs*, not *domains*. 

Unfortunately, my manager has reminded me of an essential deliverable
that I have for another project, due in two weeks.  I'm going to need
every one of those days.  So I will have to take a two week sabbatical
from this design work.

It might make sense to reconvene this work on a new thread, with a last
message on this monster thread inviting all interested parties to come
on over.  I suspect a few folks will be happy to see this thread wind
down.

I'd guess lse-tech (my preference) or ckrm-tech would be a suitable
forum for this new thread.

Carry on.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-11 22:06     ` Matthew Dobson
  2004-10-11 22:58       ` Paul Jackson
@ 2004-10-12  8:50       ` Simon Derr
  2004-10-12 21:25         ` Matthew Dobson
  1 sibling, 1 reply; 53+ messages in thread
From: Simon Derr @ 2004-10-12  8:50 UTC (permalink / raw)
  To: Matthew Dobson
  Cc: Paul Jackson, Rick Lindsley, Martin J. Bligh, Simon.Derr,
	pwil3058, frankeh, dipankar, Andrew Morton, ckrm-tech, efocht,
	LSE Tech, hch, steiner, Jesse Barnes, sylvain.jeaugey, djh, LKML,
	Andi Kleen, sivanich

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1555 bytes --]

> One of the cool thing about using sched_domains as your partitioning
> element is that in reality, tasks run on *CPUs*, not *domains*.  So if
> you have threads 'a1' & 'a2' running on CPUs 0 & 1 (small job 'a') and
> threads 'b1' & 'b2' running on CPUs 2 & 3 (small job 'b'), you can
> suspend threads a1, a2, b1 & b2 and remove the domains they were running
> in to allow job A (big job with threads A1, A2, A3, & A4) to run on the
> larger 4 CPU domain.  When you then suspend A1-A4 again to allow the
> smaller jobs to proceed, you can pretty trivially create the 2 CPU
> domains underneath the 4 CPU domain and resume the jobs.  Those jobs (a
> & b) have been suspended on the CPUs they were originally running on,
> and thus will resume on the same CPUs without any extra effort.  They
> will simply run on those CPUs, and at load balance time, the domains
> attached to those CPUs will be consulted to determine where the tasks
> can be relocated to if there is a heavy load.  The domains will tell the
> scheduler that the tasks cannot be relocated outside the 2 CPUs in each
> respective domain.  Viola!  (sorta ;)
Voilà ;-)

I agree that this looks really smooth from a scheduler point of view.

>From a user point of view, remains the issue of suspending the tasks:
-find which tasks to suspend : how do you know that job 'a' consists 
exactly of 'a1' and 'a2'
-suspend them (btw, how do you achieve this ? kill -STOP ?)


I've been away from my mail and still trying to catch up, nevermind if the 
above does not makes sense to you.

	Simon.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-11 22:58       ` Paul Jackson
@ 2004-10-12 21:22         ` Matthew Dobson
  0 siblings, 0 replies; 53+ messages in thread
From: Matthew Dobson @ 2004-10-12 21:22 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Rick Lindsley, Martin J. Bligh, Simon.Derr, pwil3058, frankeh,
	dipankar, Andrew Morton, ckrm-tech, efocht, LSE Tech, hch,
	steiner, Jesse Barnes, sylvain.jeaugey, djh, LKML, Andi Kleen,
	sivanich

On Mon, 2004-10-11 at 15:58, Paul Jackson wrote:
> Matthew wrote:
> > One of the cool thing about using sched_domains as your partitioning
> > element is that in reality, tasks run on *CPUs*, not *domains*. 
> 
> Unfortunately, my manager has reminded me of an essential deliverable
> that I have for another project, due in two weeks.  I'm going to need
> every one of those days.  So I will have to take a two week sabbatical
> from this design work.
> 
> It might make sense to reconvene this work on a new thread, with a last
> message on this monster thread inviting all interested parties to come
> on over.  I suspect a few folks will be happy to see this thread wind
> down.
> 
> I'd guess lse-tech (my preference) or ckrm-tech would be a suitable
> forum for this new thread.
> 
> Carry on.

Sounds good, Paul.  I think the discussion on this thread was kind of
winding down anyway.  In two weeks I'll have some more work done on my
code, particularly trying to get the cpusets/CKRM filesystem interface
to play with my sched_domains code.  We'll be able to digest all the the
information, requirements, requests, etc. on this thread and start a
fresh discussion on (or at least closer to) the same page.

-Matt


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2004-10-12  8:50       ` Simon Derr
@ 2004-10-12 21:25         ` Matthew Dobson
  0 siblings, 0 replies; 53+ messages in thread
From: Matthew Dobson @ 2004-10-12 21:25 UTC (permalink / raw)
  To: Simon Derr
  Cc: Paul Jackson, Rick Lindsley, Martin J. Bligh, pwil3058, frankeh,
	dipankar, Andrew Morton, ckrm-tech, efocht, LSE Tech, hch,
	steiner, Jesse Barnes, sylvain.jeaugey, djh, LKML, Andi Kleen,
	sivanich

On Tue, 2004-10-12 at 01:50, Simon Derr wrote:
> > One of the cool thing about using sched_domains as your partitioning
> > element is that in reality, tasks run on *CPUs*, not *domains*.  So if
> > you have threads 'a1' & 'a2' running on CPUs 0 & 1 (small job 'a') and
> > threads 'b1' & 'b2' running on CPUs 2 & 3 (small job 'b'), you can
> > suspend threads a1, a2, b1 & b2 and remove the domains they were running
> > in to allow job A (big job with threads A1, A2, A3, & A4) to run on the
> > larger 4 CPU domain.  When you then suspend A1-A4 again to allow the
> > smaller jobs to proceed, you can pretty trivially create the 2 CPU
> > domains underneath the 4 CPU domain and resume the jobs.  Those jobs (a
> > & b) have been suspended on the CPUs they were originally running on,
> > and thus will resume on the same CPUs without any extra effort.  They
> > will simply run on those CPUs, and at load balance time, the domains
> > attached to those CPUs will be consulted to determine where the tasks
> > can be relocated to if there is a heavy load.  The domains will tell the
> > scheduler that the tasks cannot be relocated outside the 2 CPUs in each
> > respective domain.  Viola!  (sorta ;)
> VoilÃ  ;-)

hehe...  My French spelling obviously isn't quite up to par. ;)


> I agree that this looks really smooth from a scheduler point of view.
> 
> From a user point of view, remains the issue of suspending the tasks:
> -find which tasks to suspend : how do you know that job 'a' consists 
> exactly of 'a1' and 'a2'
> -suspend them (btw, how do you achieve this ? kill -STOP ?)
> 
> 
> I've been away from my mail and still trying to catch up, nevermind if the 
> above does not makes sense to you.
> 
> 	Simon.

Paul didn't go into specifics about how to suspend the job, so neither
did I.  Sending SIGSTOP & SIGCONT should work, as you mention...  Those
are implementation details which really aren't *that* important to the
discussion.  We're still trying to figure out the overall framework and
API to work with, so which method of suspending a thread we'll
eventually use can be tackled down the road.  :)

-Matt


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2005-02-08 20:42                               ` Paul Jackson
@ 2005-02-09 17:59                                 ` Chandra Seetharaman
  2005-02-11  2:46                                   ` Chandra Seetharaman
  0 siblings, 1 reply; 53+ messages in thread
From: Chandra Seetharaman @ 2005-02-09 17:59 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Matthew Dobson, dino, mbligh, pwil3058, frankeh, dipankar, akpm,
	ckrm-tech, efocht, lse-tech, hch, steiner, jbarnes,
	sylvain.jeaugey, djh, linux-kernel, Simon.Derr, ak, sivanich

On Tue, Feb 08, 2005 at 12:42:34PM -0800, Paul Jackson wrote:
> Matthew wrote:
> 
> I found no useful and significant basis for integration of cpusets and
> CKRM either involving CPU or Memory Node management.
> 
> As best as I can figure out, CKRM is a fair share scheduler with a
> gussied up more modular architecture, so that the components to track
> usage, control (throttle) tasks, and classify tasks are separate
> plugins.  I can find no significant and useful overlap on any of these
> fronts, either the existing plugins or their infrastructure, with what
> cpusets has and needs.
> 
> There are claims that CKRM has some generalized resource management
> architecture that should be able to handle cpusets needs, but despite my
> repeated (albeit not entirely successful) efforts to find documentation
> and read source and my pleadings with Matthew and earlier on this
> thread, I was never able to figure out what this meant, or find anything
> that could profitably integrate with cpusets.

I thought Hubertus did talk about this when the last time the thread
was active. Anyways, Here is how one could do cpuset/memset under the
ckrm framework(Note that I am not pitching for a marriage :) as there are 
some small problems, like supporting 128 cpus, changing the parameter names
that ckrm currently uses):

First off cpuset and memset has to be implemented as two different
controllers.

cpuset controller:
- 'guarantee' parameter to be used for representing cpuset(bitwise)
- 'limit' parameter to be used for exclusivity and other flags.
- Highest level class(/rcfs/taskclass) will have all cpus in its list
- Every class will maintain two sets of cpusets, one that can be inherited,
  inherit_cpuset(needed when exclusive is set in a child) and the other
  for use by the class itself, my_cpuset.
- when a new class is created (under /rcfs/taskclass), it inherits all the 
  CPUS (from inherit_cpuset).
- admin can change the cpuset of this class by echoing the new 
  cpuset(guarantee) into the 'shares' file.
- admin can set/change the exclusivity(like) flags by echoing the value(limit)
  to the 'shares' file.
- When the exclusivity flag is set in a class, the cpuset bits in this class
  will be cleared in the inherit_cpuset of the parent, and all its other
  children.
- At the time of scheduling, my_cpuset in the class of the task will be
  consulted.

memset_controller would be similar to this, before pitching it I will talk
with Matt about why he thought that there is a problem.

If I missed some feature of cpuset that shows a bigger problem, please
let me know.
> 
> In sum -- I see a potential for useful integration of cpusets and
> scheduler domains, I'll have to leave it up to those with expertise in
> the scheduler to evaluate and perhaps accomplish this.  I do not see any
> useful integration of cpusets and CKRM.
> 
> I continue to be befuddled as to why, Matthew, you confound potential
> cpuset-scheddomain integration with potential cpuset-CKRM integration.
> Scheduler domains and CKRM are distinct beasts, in my book, and the
> contemplations of cpuset integration with these two beasts are also
> distinct efforts.
> 
> And cpusets and CKRM are distinct beasts.
> 
> But I repeat myself ...
> 
> -- 
>                   I won't rest till it's the best ...
>                   Programmer, Linux Scalability
>                   Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401
> 
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> ckrm-tech mailing list
> https://lists.sourceforge.net/lists/listinfo/ckrm-tech

-- 

----------------------------------------------------------------------
    Chandra Seetharaman               | Be careful what you choose....
              - sekharan@us.ibm.com   |      .......you may get it.
----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2005-02-09 17:59                                 ` [ckrm-tech] " Chandra Seetharaman
@ 2005-02-11  2:46                                   ` Chandra Seetharaman
  2005-02-11  9:21                                     ` Paul Jackson
  2005-02-11 16:54                                     ` Jesse Barnes
  0 siblings, 2 replies; 53+ messages in thread
From: Chandra Seetharaman @ 2005-02-11  2:46 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Matthew Dobson, dino, mbligh, pwil3058, frankeh, dipankar, akpm,
	ckrm-tech, efocht, lse-tech, hch, steiner, jbarnes,
	sylvain.jeaugey, djh, linux-kernel, Simon.Derr, ak, sivanich

On Wed, Feb 09, 2005 at 09:59:28AM -0800, Chandra Seetharaman wrote:
> On Tue, Feb 08, 2005 at 12:42:34PM -0800, Paul Jackson wrote:
--stuff deleted---
> memset_controller would be similar to this, before pitching it I will talk
> with Matt about why he thought that there is a problem.

Talked to Matt Dobson and explained him the CKRM architecture and how
cpuset/memset can be implemented as a ckrm controller. He is now convinced
that there is no problem in making memset also a ckrm controller.

As explained in the earlier mail, memset also can be implemented in the
same way as cpuset.

> 
> If I missed some feature of cpuset that shows a bigger problem, please
> let me know.
> > 
> > In sum -- I see a potential for useful integration of cpusets and
> > scheduler domains, I'll have to leave it up to those with expertise in
> > the scheduler to evaluate and perhaps accomplish this.  I do not see any
> > useful integration of cpusets and CKRM.
> > 
> > I continue to be befuddled as to why, Matthew, you confound potential
> > cpuset-scheddomain integration with potential cpuset-CKRM integration.
> > Scheduler domains and CKRM are distinct beasts, in my book, and the
> > contemplations of cpuset integration with these two beasts are also
> > distinct efforts.
> > 
> > And cpusets and CKRM are distinct beasts.
> > 
> > But I repeat myself ...
> > 
> > -- 
> >                   I won't rest till it's the best ...
> >                   Programmer, Linux Scalability
> >                   Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401
> > 
> > 
> > -------------------------------------------------------
> > SF email is sponsored by - The IT Product Guide
> > Read honest & candid reviews on hundreds of IT Products from real users.
> > Discover which products truly live up to the hype. Start reading now.
> > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> > _______________________________________________
> > ckrm-tech mailing list
> > https://lists.sourceforge.net/lists/listinfo/ckrm-tech
> 
> -- 
> 
> ----------------------------------------------------------------------
>     Chandra Seetharaman               | Be careful what you choose....
>               - sekharan@us.ibm.com   |      .......you may get it.
> ----------------------------------------------------------------------
> 
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> ckrm-tech mailing list
> https://lists.sourceforge.net/lists/listinfo/ckrm-tech

-- 

----------------------------------------------------------------------
    Chandra Seetharaman               | Be careful what you choose....
              - sekharan@us.ibm.com   |      .......you may get it.
----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2005-02-11  2:46                                   ` Chandra Seetharaman
@ 2005-02-11  9:21                                     ` Paul Jackson
  2005-02-12  1:37                                       ` Chandra Seetharaman
  2005-02-11 16:54                                     ` Jesse Barnes
  1 sibling, 1 reply; 53+ messages in thread
From: Paul Jackson @ 2005-02-11  9:21 UTC (permalink / raw)
  To: Chandra Seetharaman
  Cc: colpatch, dino, mbligh, pwil3058, frankeh, dipankar, akpm,
	ckrm-tech, efocht, lse-tech, hch, steiner, jbarnes,
	sylvain.jeaugey, djh, linux-kernel, Simon.Derr, ak, sivanich

[ For those who have already reached a conclusion on this
  subject, there is little that is new below.  It's just
  cast in a different light, as an analysis of how well
  the CKRM cpuset/memset task class that Chandra describes
  meets the needs of cpusets.  The conclusion is: not well.

  A pickup truck and a motorcycle both have their uses.
  It's just difficult to combine them in a useful fashion.

  Feel free to skim or skip the rest of this message. -pj ]

Chandra writes:
> If I missed some feature of cpuset that shows a bigger problem, please
> let me know.

Perhaps it would be better if first you ask yourself what
features your cpuset/memset taskclasses provide beyond
what's available in the basic sched_setaffinity (for cpu)
and mbind/set_mempolicy (for memory) calls.  Offhand, I don't
see any.

But, I will grant, with my apologies, that I wrote the above
more in irritation than in a sincere effort to explain.

So, let me come at this through another door.

Since it seems apparent by now that both numa placement and
workload management cause some form of mutually exclusive brain
damage to its practitioners, making it difficult for either to
understand the other, let me:
 1) describe the important properties of cpusets,
 2) examine how well your proposal provides such, and
 3) examine its additional costs compared to cpusets.

1. The important properties of cpusets.
=======================================

Cpusets facilitate integrated processor and memory placement
of jobs on large systems, especially useful on numa systems,
where the co-ordinated placement of jobs on cpus and memory is
important, sometimes critical, to obtaining good performance.

It is becoming increasingly obvious, as Intel, IBM and AMD
push more and more cores into one package at one end, and as
NEC, IBM, Bull, SGI and others push more and more packages into
single image systems at the other end, that complex layered numa
topologies are here to stay, in increasing number and complexity.

Cpusets helps manage numa placement of jobs in a way that
numa folks seem to find makes sense.  The names of key
interface elements, and the opening remarks in commentary and
documentation are specific and relevant to the needs of those
doing numa placement.

It does so with a minimal, low cost patch in the main kernel.
Running diffstat on the cpuset* patches in 2.6.11-rc1-mm2 shows
the following summary stats:

  19 files changed, 2362 insertions(+), 253 deletions(-)

The runtime costs are nearly zero, consisting in the usual
case on any hot paths of a usage counter increment at fork, a
usage counter decrement at exit, a usually inconsequential
bitmask test in mm/page_alloc.c, and a generation number
check in the mm/mempolicy.c alloc_page_vma() wrapper to
__alloc_pages().

Cpusets handles any number of CPUs and Memory Nodes, with no
practical hard limit imposed by the API or data types.

Cpusets can be used in combination with a workload manager
such as CKRM.  You can use cpusets to create "soft partitions"
that are subsets of the entire system, and then in each such
partition, you can run a separate instance of a workload manager
to obtain the desired resource sharing.

Cpusets may provide a practical API to support administrative
refinements of scheduler domains, along more optimal natural
job boundaries, instead of just along automatic, artificial
architecture boundaries.  Matthew and Nick both seem to be
making mumblings in this direction, but the jury is still out.
Indeed, we're still investigating.  I have not heard of anyone
proposing to integrate CKRM and sched domains in this manner,
nor do I expect to.

There is no reason to artificially limit the depth of the cpuset
hierarchy, which represents subsets of subsets of cpus and nodes.
The rules (invariants) of cpusets have been carefully chosen
so as to never require any global or wide ranging analysis of
the cpuset hierarchy in order to enforce.  Each child must be
a subset of its parent, and exclusive cpusets cannot overlap
their siblings.  That's about it.  Both rules can be evaluated
locally, using just the nearest relatives of an affected cpuset.

An essential feature of the cpuset proposal is its file system
model of the 'nested subsets of cpus and nodes'.  This provides
a name space, and permission model, that supports sensible
administration of numa friendly subsets of the compute resources
of large systems in complex administration environments.
A system can be dynamically 'partitioned' and 'sub-partitioned',
with sensible names and permissions for the partitions, while
maintaining the benefits of a single system image.  This is
a classic use of a kernel, to manage a system wide resource
with a name space, structure rules, resource attributes, and
a permission/access model.

In sum, cpusets provides substantial benefit past the individual
sched_setaffinity/mbind/set_mempolicy calls for managing the
numa placement of jobs on large systems, at modest cost in
code size, runtime, maintenance and intellectual mastery.

2. How much of the above does your proposal provide?
====================================================

Not much.  As best as I can tell, it provides an alternative
to the existing numa cpu and memory calls, at the cost of
considerable code, complexity and obtuseness above and beyond
cpusets.  That additional complexity may well be necessary,
for the more difficult job it is trying to accomplish.  But it
is not necessary for the simpler task of numa placement of jobs
on named, controlled, subsets of cpus and memory nodes.

Your proposal doesn't provide a distinguished "numa computation
unit" (cpu + memory), but rather tends to lose those two elements
in a longer list of task class elements.

I can't tell if it's just because you didn't take much time to
study cpusets, or if it's due to more essential limitations
of the CKRM implementation, but you got the subsetting and
exclusive rules wrong (or at least different).

The CKRM documentation and the names of key flags and such are
not intuitive to those doing numa work.  If one comes at CKRM
from the perspective of someone trying to solve a numa placement
problem, the interfaces, documentation and naming really don't
make sense.  Even if your architecture is more general and
powerful, I suspect your presentation is not widely accessible
outside those with a workload focus.  Or perhaps I'm just more
dimwitted than most.  It's difficult for me to know which.
But certainly both Matthew and I have struggled to make sense
of CKRM from a numa perspective.

You state you'd have a 128 CPU limitation.  I don't know why
that would be, but it would be a critical imitation for SGI --
no small problem.

As explained below, with your proposal, one could not readily do
both workload management and numa placement at the same time,
because the task class hierarchy needed for the two is not
the same.

As noted above, while there seems to be a decent chance that
cpusets will provide some benefit to scheduler domains, allowing
the option of organizing sched domains along actual job usage
lines instead of artificial architecture lines, I have seen
no suggestion that CKRM task classes have that potential to
improve sched domains.

Elsewhere I recall you've had to impose fairly modest bounds
on the depth of your class hierarchy, because your resource
balancing rules are expensive to evaluate across deep, large
trees.  The cpuset hierarchy has no such restraint.

Your task class hierarchy, if hijacked for numa placement,
might provide the kernel managed naming, structure and
access control of dynamic (soft) numa partitions that cpusets
does.  I haven't looked closely at the permission model of
CKRM to see if it matches the needs of cpusets, so I can't
speak to that detail.

In sum, your cpuset/memset CKRM proposal provides few, if any,
of the additional benefits to numa placement work that cpusets
provides over the existing affinity and numa system calls.

3. What are the additional costs of your proposal over cpusets?
===============================================================

Your proposal, while it seems to offer little advantage for
numa placement to what we already have without cpusets, comes
at a substantial cost great than cpusets.

The CKRM patch is five times the size of the cpuset patch,
with diffstat on the ckrm-e17.2610.patch showing:

  65 files changed, 13020 insertions(+), 19 deletions(-)

The CKRM runtime, from what I can tell on the lmbench slide
from OLS 2004, costs several percent of available cycles.

You propose to include the cpu/mem placement hierarchy in the
task class hierarchy.  This presents difficulties.  Essentially,
they are not the same hierarchies.  A jobs placement is
independent of its priority.  Both high and low priority jobs
may well require proper numa placement, and both high and low
priority tasks may well run within the same cpuset.

So if your task class hierarchy is hijacked for numa placement,
it will not serve you well for workload management.  On a system
that required numa placement using something like cpusets, the
fives times larger size of the kernel patch required for CKRM
would be entirely unjustified, as CKRM would only be usable
for its cpuset-like capabilities.

Much of what you have now in CKRM would be useless for cpuset
work.  As you observed in your proposal, you would need new
cpuset related rules for the subset and exclusive properties.

The cpuset scheduler hook is none - it only needs the
existing cpus_allowed check that Ingo already added, years ago.
You propose having the scheduler check the appropriate cpu mask
in the task class, which would definitely increase the cache
footprint size of the scheduler.

The papers for CKRM speak of providing policy driven
classification and differentiated service.  The focus is on
managing resource sharing, to allow different classes of tasks
to get controlled allocations of proportions of shared resources.

Cpusets is not about sharing proportions of a common resource,
but rather about dedicating entire resources.  Granted,
mathematically, there might be a mapping between these two.
But is it certainly an impediment to those having to understand
something, if it is implemented by abusing something quite
larger and quite foreign in intention.

This flows through to the names of the specific files in the
directory representing a cpuset or class.  The names for CKRM
class directories are necessarily rather generic and abstract,
whereas those for cpusets directly represent the particular
need of placing tasks on cpus and memory nodes.  For someone
doing numa placement, the latter are much easier to understand.

And as noted above, since you can't do both at the same time
(both use the CKRM infrastructure for its traditional workload
management and use it for numa placement) it's not like the
administrator of such a system gains any from the more abstract
names, if they are just using it for cpusets (numa placement).

There is no synergy in the kernel hooks required in the scheduler
and memory allocator.  The hooks required by cpusets check
bitmasks in order to allow or prohibit scheduling a task on
a CPU, or allocating a page from a particular node to a task.
These are quite distinct from the hooks required by CKRM when
used as a fair share scheduler and workload manager, which
requires adding delays to tasks in order to obtain the desired
proportion of resource usage between classes.  Similarly, the
CKRM memory allocator hooks manage the number of pages in use
by each task class and/or the rate of page faults, while the
cpuset memory allocator hooks manage which memory nodes are
available to satisfy an allocation request.

The share usage hooks that monitor each resource, and its usage
by each class, are useless for cpusets, which has no dependency
on resource usage.  In cpusets, a task can use as much of its
allowed CPUs and Memory Nodes, without throttling.  There is
no feedback loop based on rates of resource usage per class.

Most of the hooks required by the CKRM classification engine to
check for possible changes in a tasks class, such as in fork,
exec, setuid, listen, and other points where a kernel object
might change are not needed for cpusets.  The cpuset patch only
requires such state change hooks in fork, exit and allocation,
and only requires to increment or decrement a usage count in
the fork and exit, and check a generation number in allocation.

Cpusets has no use for a kernel classification engine.  Outside
of the trivial, automatic propagation of cpusets in fork and
exit, the only changes in cpusets are mandated from user space.

Nor do cpusets have any need for the kernel to support externally
defined policy rules.  Cpusets has no use for the classification
engines callback mechanism.  In cpusets, no events that might
affect state, such as fork, exit, reclassifications, changes in
uid, or resource rate usage samples, need to be reported to any
state agent, and there is no state agent, nor any communication
channel thereto.

Cpusets has no use for a facility that lets server tasks tell
some external classifier what phase they are operating in.
Cpusets has no need for some workload manager to be sampling
resource consumption and task state to determine resource
consumption.  Cpusets has no need to track, in user space or
kernel, the state of tasks after they exit. Cpusets has no use
for delays nor for tracking them in the task struct.

Cpusets has no need for the hooks at the entry to, and exit from,
memory allocation routines to distinguish delays due to memory
allocation from those due to application i/o.  Cpusets has no
need for sampling task state at fixed intervals, and our big
iron scientific customers would without a doubt not tolerate a
scan of the entire set of tasks every second for such resource
and task state data collection.  Such a scan does _not_ scale
well on big honkin numa boxes.  Whereas CKRM requires something
like relayfs to pass back to user space the constant stream of
such data, cpusets has no such needs and no such data.

Certainly, none of the network hooks that CKRM requires to
provide differentiated service across priority classes would be
of any use in a system (ab)using CKRM to provide cpuset style
numa placement.

It is true that both cpusets and CKRM make good use of the Linux
kernel's virtual file system (vfs).  Cpusets uses vfs to model
the hierarchy of 'soft partitions' in the system.  CKRM uses vfs
to model a resource priority hierarchy, essentially replacing a
single 'task priority' with hierarchical resource allocations,
managing what proportion, out of what is available, of fungible
resources such as ticks, cycles, bytes or data transfers a
given class of tasks is allowed to use in the aggregate.

Just because two facilities use vfs is certainly not sufficient
basis for deciding that they should be combined into one
facility.

The shares and stats control files in each task_class
directory are not needed by cpusets, but new control files,
for cpus_allowed and mems_allowed are needed.  That, or the
existing names have to be overloaded, at the cost of obfuscating
the interface.

The kernel hooks for cpusets are fewer, simpler and more specific
than those for CKRM.  Our high performance customers would want
the cpuset hooks compiled in, not the more generic ones for
CKRM (which they could not easily use for any other workload
management purpose anyway, if the task class hierarchy were
hijacked for the needs of cpusets, as noted above).

The development costs of cpusets so far, which are perhaps the
best predictor we have of future costs, have been substantially
lower than they have been for CKRM.

In sum, your proposal costs alot more than cpusets, by a variety
of metrics.

=================================================

In summary, I find that your cpuset/memset CKRM proposal provides
little or no benefit past the simpler cpu and memory placement
calls already available, while costing substantially more in
a variety of ways than my cpuset proposal, when evaluated for
its usefulness for numa placement.

(Of course, if evaluated for suitability for workload management,
the table is turned, and your CKRM patch provides essential
capability that my cpuset patch could never dream of doing.)

Moreover, the additional workload management benefits that your
CKRM facility provides, and that some of my customers might
want to use in combination with numa placement, would probably
become unavailable to them if we integrated cpusets and CKRM,
because cpusets would have to hijack the task class hierarchy
for its own nefarious purposes.

Such an attempt to integrate cpusets and CKRM would be a major
setback for cpusets, substantially increasing its costs and
reducing is value, probably well past the point of it even being
worth persuing further, in the mainstream kernel.  Adding all
that foreign logic of cpusets to the CKRM patch probably
wouldn't help CKRM much either.  The CKRM patch is already one
that requires a bright mind and some careful thought to master.
Adding cpuset numa placement logic, which is typically different
in detail, would add a complexity burden to the CKRM code that
would serve no one well.

> Note that I am not pitching for a marriage

We agree.

I just took more words to say it ').

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2005-02-11  2:46                                   ` Chandra Seetharaman
  2005-02-11  9:21                                     ` Paul Jackson
@ 2005-02-11 16:54                                     ` Jesse Barnes
  2005-02-11 18:42                                       ` Chandra Seetharaman
  1 sibling, 1 reply; 53+ messages in thread
From: Jesse Barnes @ 2005-02-11 16:54 UTC (permalink / raw)
  To: Chandra Seetharaman
  Cc: Paul Jackson, Matthew Dobson, dino, mbligh, pwil3058, frankeh,
	dipankar, akpm, ckrm-tech, efocht, lse-tech, hch, steiner,
	sylvain.jeaugey, djh, linux-kernel, Simon.Derr, ak, sivanich

On Thursday, February 10, 2005 6:46 pm, Chandra Seetharaman wrote:
> On Wed, Feb 09, 2005 at 09:59:28AM -0800, Chandra Seetharaman wrote:
> > On Tue, Feb 08, 2005 at 12:42:34PM -0800, Paul Jackson wrote:
>
> --stuff deleted---
>
> > memset_controller would be similar to this, before pitching it I will
> > talk with Matt about why he thought that there is a problem.
>
> Talked to Matt Dobson and explained him the CKRM architecture and how
> cpuset/memset can be implemented as a ckrm controller. He is now convinced
> that there is no problem in making memset also a ckrm controller.
>
> As explained in the earlier mail, memset also can be implemented in the
> same way as cpuset.

Arg!  Look, cpusets is *done* (i.e. it works well) and relatively simple and 
easy to use.  It's also been in -mm for quite some time.  It also solves the 
problem of being able to deal with large jobs on large systems rather 
elegantly.  Why oppose its inclusion upstream?

CKRM seems nice, but why is it not in -mm?  I've heard it talked about a lot, 
but it usually comes up as a response to some other, simpler project, in the 
vein of "ckrm can do this, so your project is not needed" and needless to say 
that's a bit frustrating.  I'm not saying that ckrm isn't useful--indeed it 
seems like an idea with a lot of utility (I liked Rik's ideas for using it to 
manage desktop boxes and multiuser systems as a sort of per-process rlimits 
on steroids), but using it for system partitioning or systemwide accounting 
seems a bit foolish to me...

Jesse

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2005-02-11 16:54                                     ` Jesse Barnes
@ 2005-02-11 18:42                                       ` Chandra Seetharaman
  2005-02-11 18:50                                         ` Jesse Barnes
  0 siblings, 1 reply; 53+ messages in thread
From: Chandra Seetharaman @ 2005-02-11 18:42 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Paul Jackson, Matthew Dobson, dino, mbligh, pwil3058, frankeh,
	dipankar, akpm, ckrm-tech, efocht, lse-tech, hch, steiner,
	sylvain.jeaugey, djh, linux-kernel, Simon.Derr, ak, sivanich

On Fri, Feb 11, 2005 at 08:54:52AM -0800, Jesse Barnes wrote:
> On Thursday, February 10, 2005 6:46 pm, Chandra Seetharaman wrote:
> > On Wed, Feb 09, 2005 at 09:59:28AM -0800, Chandra Seetharaman wrote:
> > > On Tue, Feb 08, 2005 at 12:42:34PM -0800, Paul Jackson wrote:
> >
> > --stuff deleted---
> >
> > > memset_controller would be similar to this, before pitching it I will
> > > talk with Matt about why he thought that there is a problem.
> >
> > Talked to Matt Dobson and explained him the CKRM architecture and how
> > cpuset/memset can be implemented as a ckrm controller. He is now convinced
> > that there is no problem in making memset also a ckrm controller.
> >
> > As explained in the earlier mail, memset also can be implemented in the
> > same way as cpuset.
> 
> Arg!  Look, cpusets is *done* (i.e. it works well) and relatively simple and 
> easy to use.  It's also been in -mm for quite some time.  It also solves the 
> problem of being able to deal with large jobs on large systems rather 
> elegantly.  Why oppose its inclusion upstream?

Jesse,

Do note that I did not oppose the cpuset inclusion(by saying that, "I am not
pitching for a marriage"), and here are the reasons:

1.Eventhough cpuset can be implemented under ckrm, currently the cpu controller
  and mem controller(in ckrm) cannot handle the isolating part of the cpuset stuff
  cleanly and provide the resource management capabilities ckrm is supposed to 
  provide. For that reason, one cannot expect both the cpuset and ckrm functionality
  in a same kernel.
2.I doubt that users that need cpuset will need the resource management capabilities
  ckrm provides.

My email was intented mainly to erase the notion that ckrm cannot handle cpuset.
Also, I wanted to understand if there is any real issues and that is why I talked
with Matt about why he thought ckrm cannot accomodate memset before sending the
second piece of mail.

> 
> CKRM seems nice, but why is it not in -mm?  I've heard it talked about a lot, 
> but it usually comes up as a response to some other, simpler project, in the 

We did post to lkml a while back and got comments on it. We are working on it and
will post the fixed code again in few weeks with couple of controllers.

> vein of "ckrm can do this, so your project is not needed" and needless to say 
> that's a bit frustrating.  I'm not saying that ckrm isn't useful--indeed it 
> seems like an idea with a lot of utility (I liked Rik's ideas for using it to 
> manage desktop boxes and multiuser systems as a sort of per-process rlimits 
> on steroids), but using it for system partitioning or systemwide accounting 
> seems a bit foolish to me...
> 
> Jesse

-- 

----------------------------------------------------------------------
    Chandra Seetharaman               | Be careful what you choose....
              - sekharan@us.ibm.com   |      .......you may get it.
----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2005-02-11 18:42                                       ` Chandra Seetharaman
@ 2005-02-11 18:50                                         ` Jesse Barnes
  0 siblings, 0 replies; 53+ messages in thread
From: Jesse Barnes @ 2005-02-11 18:50 UTC (permalink / raw)
  To: Chandra Seetharaman
  Cc: Paul Jackson, Matthew Dobson, dino, mbligh, pwil3058, frankeh,
	dipankar, akpm, ckrm-tech, efocht, lse-tech, steiner,
	sylvain.jeaugey, djh, linux-kernel, Simon.Derr, sivanich

On Friday, February 11, 2005 10:42 am, Chandra Seetharaman wrote:
> My email was intented mainly to erase the notion that ckrm cannot handle
> cpuset. Also, I wanted to understand if there is any real issues and that
> is why I talked with Matt about why he thought ckrm cannot accomodate
> memset before sending the second piece of mail.

Great!  So cpusets is good to go for the mainline then (i.e. no major 
objections to the interface).  Note that implementation details that don't 
affect the interface are another subject entirely, e.g. the sched domains 
approach for scheduling as opposed to cpus_allowed.

> > CKRM seems nice, but why is it not in -mm?  I've heard it talked about a
> > lot, but it usually comes up as a response to some other, simpler
> > project, in the
>
> We did post to lkml a while back and got comments on it. We are working on
> it and will post the fixed code again in few weeks with couple of
> controllers.

Excellent, I hope that it comes together into a form suitable for the 
mainline, I think there are some really nice aspects to it.

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2005-02-11  9:21                                     ` Paul Jackson
@ 2005-02-12  1:37                                       ` Chandra Seetharaman
  2005-02-12  6:16                                         ` Paul Jackson
  0 siblings, 1 reply; 53+ messages in thread
From: Chandra Seetharaman @ 2005-02-12  1:37 UTC (permalink / raw)
  To: Paul Jackson
  Cc: colpatch, dino, mbligh, pwil3058, frankeh, dipankar, akpm,
	ckrm-tech, efocht, lse-tech, hch, steiner, jbarnes,
	sylvain.jeaugey, djh, linux-kernel, Simon.Derr, ak, sivanich

On Fri, Feb 11, 2005 at 01:21:12AM -0800, Paul Jackson wrote:
> [ For those who have already reached a conclusion on this
>   subject, there is little that is new below.  It's just
>   cast in a different light, as an analysis of how well
>   the CKRM cpuset/memset task class that Chandra describes
>   meets the needs of cpusets.  The conclusion is: not well.
> 
>   A pickup truck and a motorcycle both have their uses.
>   It's just difficult to combine them in a useful fashion.
> 
>   Feel free to skim or skip the rest of this message. -pj ]
> 
[ As replied in a earlier mail I am not advocating for cpuset to be
  a ckrm controller. In this mail I am just providing clarifications
  for some of Paul's comments. -chandra ]

> 
> Chandra writes:
> > If I missed some feature of cpuset that shows a bigger problem, please
> > let me know.
> 
> Perhaps it would be better if first you ask yourself what
> features your cpuset/memset taskclasses provide beyond

First off, I wasn't pitching for 'our' cpuset/memset taskclass. I was 
suggesting that 'your' cpuset can be a ckrm controller.


> what's available in the basic sched_setaffinity (for cpu)
> and mbind/set_mempolicy (for memory) calls.  Offhand, I don't
> see any.

and it don't have to be same as what the above functions provide. cpuset
can function exactly the same way under ckrm as it does otherwise.

> 
> But, I will grant, with my apologies, that I wrote the above
> more in irritation than in a sincere effort to explain.
> 
> So, let me come at this through another door.
> 
> Since it seems apparent by now that both numa placement and
> workload management cause some form of mutually exclusive brain
> damage to its practitioners, making it difficult for either to
> understand the other, let me:
>  1) describe the important properties of cpusets,
>  2) examine how well your proposal provides such, and
>  3) examine its additional costs compared to cpusets.
> 
> 1. The important properties of cpusets.
> =======================================
>  
> Cpusets facilitate integrated processor and memory placement
> of jobs on large systems, especially useful on numa systems,
> where the co-ordinated placement of jobs on cpus and memory is
> important, sometimes critical, to obtaining good performance.
> 
> It is becoming increasingly obvious, as Intel, IBM and AMD
> push more and more cores into one package at one end, and as
> NEC, IBM, Bull, SGI and others push more and more packages into
> single image systems at the other end, that complex layered numa
> topologies are here to stay, in increasing number and complexity.
> 
> Cpusets helps manage numa placement of jobs in a way that
> numa folks seem to find makes sense.  The names of key
> interface elements, and the opening remarks in commentary and
> documentation are specific and relevant to the needs of those
> doing numa placement.
> 
> It does so with a minimal, low cost patch in the main kernel.
> Running diffstat on the cpuset* patches in 2.6.11-rc1-mm2 shows
> the following summary stats:
> 
>   19 files changed, 2362 insertions(+), 253 deletions(-)
> 
> The runtime costs are nearly zero, consisting in the usual
> case on any hot paths of a usage counter increment at fork, a
> usage counter decrement at exit, a usually inconsequential
> bitmask test in mm/page_alloc.c, and a generation number
> check in the mm/mempolicy.c alloc_page_vma() wrapper to
> __alloc_pages().
> 
> Cpusets handles any number of CPUs and Memory Nodes, with no
> practical hard limit imposed by the API or data types.
> 
> Cpusets can be used in combination with a workload manager
> such as CKRM.  You can use cpusets to create "soft partitions"
> that are subsets of the entire system, and then in each such
> partition, you can run a separate instance of a workload manager
> to obtain the desired resource sharing.

CKRM's controllers currently may not play well with cpusets.
> 
> Cpusets may provide a practical API to support administrative
> refinements of scheduler domains, along more optimal natural
> job boundaries, instead of just along automatic, artificial
> architecture boundaries.  Matthew and Nick both seem to be
> making mumblings in this direction, but the jury is still out.
> Indeed, we're still investigating.  I have not heard of anyone
> proposing to integrate CKRM and sched domains in this manner,
> nor do I expect to.

I haven't looked at sched_domains closely. May be I should and see how we
can form a synergy.

> 
> There is no reason to artificially limit the depth of the cpuset
> hierarchy, which represents subsets of subsets of cpus and nodes.
> The rules (invariants) of cpusets have been carefully chosen
> so as to never require any global or wide ranging analysis of
> the cpuset hierarchy in order to enforce.  Each child must be
> a subset of its parent, and exclusive cpusets cannot overlap
> their siblings.  That's about it.  Both rules can be evaluated
> locally, using just the nearest relatives of an affected cpuset.
> 
> An essential feature of the cpuset proposal is its file system
> model of the 'nested subsets of cpus and nodes'.  This provides
> a name space, and permission model, that supports sensible
> administration of numa friendly subsets of the compute resources
> of large systems in complex administration environments.
> A system can be dynamically 'partitioned' and 'sub-partitioned',
> with sensible names and permissions for the partitions, while
> maintaining the benefits of a single system image.  This is
> a classic use of a kernel, to manage a system wide resource
> with a name space, structure rules, resource attributes, and
> a permission/access model.
> 
> In sum, cpusets provides substantial benefit past the individual
> sched_setaffinity/mbind/set_mempolicy calls for managing the
> numa placement of jobs on large systems, at modest cost in
> code size, runtime, maintenance and intellectual mastery.
> 
> 
> 2. How much of the above does your proposal provide?
> ====================================================
> 
> Not much.  As best as I can tell, it provides an alternative
> to the existing numa cpu and memory calls, at the cost of
> considerable code, complexity and obtuseness above and beyond
> cpusets.  That additional complexity may well be necessary,
> for the more difficult job it is trying to accomplish.  But it
> is not necessary for the simpler task of numa placement of jobs
> on named, controlled, subsets of cpus and memory nodes.

I was answering a different question: whether ckrm can accomodate
cpuset or not ? ( i 'll talk about the complexity part later).

> 
> Your proposal doesn't provide a distinguished "numa computation
> unit" (cpu + memory), but rather tends to lose those two elements
> in a longer list of task class elements.

It doesn't readily provide it, but the architecture can provide it.

> 
> I can't tell if it's just because you didn't take much time to
> study cpusets, or if it's due to more essential limitations
> of the CKRM implementation, but you got the subsetting and
> exclusive rules wrong (or at least different).

My understanding was that, if a class/cpuset has an exclusive flag
set, then those cpus can be used only by this cpuset and its parent,
and no other cpusets in the system. 

I did get one thing wrong, I did not realize that you do not allow
setting the exclusive flag in a cpuset if any of its siblings has
any of this cpuset's cpus. (May be i still didn't get it right)....

But, that doesn't change what I wrote in my earlier mail,
because, all these details are controller specific and i do not see
any limitation from ckrm's point of view in this context.

> 
> The CKRM documentation and the names of key flags and such are
> not intuitive to those doing numa work.  If one comes at CKRM
> from the perspective of someone trying to solve a numa placement
> problem, the interfaces, documentation and naming really don't
> make sense.  Even if your architecture is more general and
> powerful, I suspect your presentation is not widely accessible
> outside those with a workload focus.  Or perhaps I'm just more
> dimwitted than most.  It's difficult for me to know which.
> But certainly both Matthew and I have struggled to make sense
> of CKRM from a numa perspective.

I agree. The filenames are not intuitive for cpuset purposes.

> 
> You state you'd have a 128 CPU limitation.  I don't know why
> that would be, but it would be a critical imitation for SGI --
> no small problem.

I understand it is critical for SGI. I said it is a small problem 
because it can be worked out easily.

> 
> As explained below, with your proposal, one could not readily do
> both workload management and numa placement at the same time,
> because the task class hierarchy needed for the two is not
> the same.
> 
> As noted above, while there seems to be a decent chance that
> cpusets will provide some benefit to scheduler domains, allowing
> the option of organizing sched domains along actual job usage
> lines instead of artificial architecture lines, I have seen
> no suggestion that CKRM task classes have that potential to
> improve sched domains.
> 
> Elsewhere I recall you've had to impose fairly modest bounds
> on the depth of your class hierarchy, because your resource
> balancing rules are expensive to evaluate across deep, large
> trees.  The cpuset hierarchy has no such restraint.

We put the limitation in the architecture because of controllers. 
We can open it up to allow deeper hierarchy and let the controllers
decide how deep they can support.

> 
> Your task class hierarchy, if hijacked for numa placement,

I wasn't suggestint the cpuset controller to hijack ckrm's task
hierarchy, I was suggesting to play within.

Controllers don't hijack hierarchy. Hierarchy is only for classes,
controllers have control over only their portion of a class.

> might provide the kernel managed naming, structure and
> access control of dynamic (soft) numa partitions that cpusets
> does.  I haven't looked closely at the permission model of
> CKRM to see if it matches the needs of cpusets, so I can't
> speak to that detail.

Are you talking about allowing users to manage their own class/cpusets ?
If so, we do have them.

> 
> In sum, your cpuset/memset CKRM proposal provides few, if any,
> of the additional benefits to numa placement work that cpusets
> provides over the existing affinity and numa system calls.
> 
> 
> 3. What are the additional costs of your proposal over cpusets?
> ===============================================================
> 
> Your proposal, while it seems to offer little advantage for
> numa placement to what we already have without cpusets, comes
> at a substantial cost great than cpusets.
> 
> The CKRM patch is five times the size of the cpuset patch,
> with diffstat on the ckrm-e17.2610.patch showing:
> 
>   65 files changed, 13020 insertions(+), 19 deletions(-)

ckrm-e17 has the whole stack(core, rcfs, taskclass, socketclass, delay
accounting, rbce, crbce, numtasks controller and listenaq controller).

But, for your purposes or our discussions one would need only 3 modules of
the above (core, rcfs and taskclass). I just compared it with the broken
up patches we posted on lkml recently. The whole stack has 12227 insertions
of which only 4554 insertions correspond to the 3 modules listed.

> 
> The CKRM runtime, from what I can tell on the lmbench slide
> from OLS 2004, costs several percent of available cycles.

The graph you see in the presentation is with the CPU controller. Not
for the core ckrm. We don't have to include CPU controller to get cpuset
working as a controller.

> 
> You propose to include the cpu/mem placement hierarchy in the
> task class hierarchy.  This presents difficulties.  Essentially,
> they are not the same hierarchies.  A jobs placement is
> independent of its priority.  Both high and low priority jobs
> may well require proper numa placement, and both high and low
> priority tasks may well run within the same cpuset.
> 
> So if your task class hierarchy is hijacked for numa placement,
> it will not serve you well for workload management.  On a system
> that required numa placement using something like cpusets, the
> fives times larger size of the kernel patch required for CKRM

As explained above, it is not 5 timer larger.

> would be entirely unjustified, as CKRM would only be usable
> for its cpuset-like capabilities.
> 
> Much of what you have now in CKRM would be useless for cpuset
> work.  As you observed in your proposal, you would need new
> cpuset related rules for the subset and exclusive properties.

ckrm doesn't need new rules, the subset and exclusive property handling
will be the functionality of the cpuset controller.

> 
> The cpuset scheduler hook is none - it only needs the
> existing cpus_allowed check that Ingo already added, years ago.
> You propose having the scheduler check the appropriate cpu mask
> in the task class, which would definitely increase the cache
> footprint size of the scheduler.

agree, one more level of indirection(instead of task->cpuset->cpus_allowed
it will be task->taskclass->res[CPUSET]->cpus_allowed).

> 
> The papers for CKRM speak of providing policy driven
> classification and differentiated service.  The focus is on
> managing resource sharing, to allow different classes of tasks
> to get controlled allocations of proportions of shared resources.
> 
> Cpusets is not about sharing proportions of a common resource,
> but rather about dedicating entire resources.  Granted,
> mathematically, there might be a mapping between these two.
> But is it certainly an impediment to those having to understand
> something, if it is implemented by abusing something quite
> larger and quite foreign in intention.
> 
> This flows through to the names of the specific files in the
> directory representing a cpuset or class.  The names for CKRM
> class directories are necessarily rather generic and abstract,
> whereas those for cpusets directly represent the particular
> need of placing tasks on cpus and memory nodes.  For someone
> doing numa placement, the latter are much easier to understand.
> 
> And as noted above, since you can't do both at the same time
> (both use the CKRM infrastructure for its traditional workload
> management and use it for numa placement) it's not like the
> administrator of such a system gains any from the more abstract
> names, if they are just using it for cpusets (numa placement).
> 
> There is no synergy in the kernel hooks required in the scheduler
> and memory allocator.  The hooks required by cpusets check
> bitmasks in order to allow or prohibit scheduling a task on
> a CPU, or allocating a page from a particular node to a task.
> These are quite distinct from the hooks required by CKRM when
> used as a fair share scheduler and workload manager, which
> requires adding delays to tasks in order to obtain the desired
> proportion of resource usage between classes.  Similarly, the
> CKRM memory allocator hooks manage the number of pages in use
> by each task class and/or the rate of page faults, while the
> cpuset memory allocator hooks manage which memory nodes are
> available to satisfy an allocation request.

I think this is where we go tangential. When you say CKRM you refer
the whole stack.

When we say CKRM, we mean only the framework(core, rcfs and taskclass or
socketclass).  It is the frame work that enables the user to define classes
and classify tasks or sockets.

All the other modules are optional and exchangable.

CKRM has different configurable modules that has their defined purposes.
One doesn't have to include a module if they don't need it.

> 
> The share usage hooks that monitor each resource, and its usage
> by each class, are useless for cpusets, which has no dependency
> on resource usage.  In cpusets, a task can use as much of its
> allowed CPUs and Memory Nodes, without throttling.  There is
> no feedback loop based on rates of resource usage per class.
> 
> Most of the hooks required by the CKRM classification engine to
> check for possible changes in a tasks class, such as in fork,
> exec, setuid, listen, and other points where a kernel object
> might change are not needed for cpusets.  The cpuset patch only
> requires such state change hooks in fork, exit and allocation,
> and only requires to increment or decrement a usage count in
> the fork and exit, and check a generation number in allocation.
> 
> Cpusets has no use for a kernel classification engine.  Outside
> of the trivial, automatic propagation of cpusets in fork and
> exit, the only changes in cpusets are mandated from user space.
> 
> Nor do cpusets have any need for the kernel to support externally
> defined policy rules.  Cpusets has no use for the classification
> engines callback mechanism.  In cpusets, no events that might
> affect state, such as fork, exit, reclassifications, changes in
> uid, or resource rate usage samples, need to be reported to any
> state agent, and there is no state agent, nor any communication
> channel thereto.
> 
> Cpusets has no use for a facility that lets server tasks tell
> some external classifier what phase they are operating in.
> Cpusets has no need for some workload manager to be sampling
> resource consumption and task state to determine resource
> consumption.  Cpusets has no need to track, in user space or
> kernel, the state of tasks after they exit. Cpusets has no use
> for delays nor for tracking them in the task struct.
> 
> Cpusets has no need for the hooks at the entry to, and exit from,
> memory allocation routines to distinguish delays due to memory
> allocation from those due to application i/o.  Cpusets has no
> need for sampling task state at fixed intervals, and our big
> iron scientific customers would without a doubt not tolerate a
> scan of the entire set of tasks every second for such resource
> and task state data collection.  Such a scan does _not_ scale
> well on big honkin numa boxes.  Whereas CKRM requires something
> like relayfs to pass back to user space the constant stream of
> such data, cpusets has no such needs and no such data.
> 
> Certainly, none of the network hooks that CKRM requires to
> provide differentiated service across priority classes would be
> of any use in a system (ab)using CKRM to provide cpuset style
> numa placement.

With the explanations above, I think you would now agree that all 
the above comments are invalidated. Basically you don't have to
bring them in if you don't need them.

> 
> It is true that both cpusets and CKRM make good use of the Linux
> kernel's virtual file system (vfs).  Cpusets uses vfs to model
> the hierarchy of 'soft partitions' in the system.  CKRM uses vfs
> to model a resource priority hierarchy, essentially replacing a
> single 'task priority' with hierarchical resource allocations,
> managing what proportion, out of what is available, of fungible
> resources such as ticks, cycles, bytes or data transfers a
> given class of tasks is allowed to use in the aggregate.
> 
> Just because two facilities use vfs is certainly not sufficient
> basis for deciding that they should be combined into one
> facility.
> 
> The shares and stats control files in each task_class
> directory are not needed by cpusets, but new control files,
> for cpus_allowed and mems_allowed are needed.  That, or the
> existing names have to be overloaded, at the cost of obfuscating
> the interface.

shares file can accomodate these. But, for bigger configuration we 
have to use some file based interface.

> 
> The kernel hooks for cpusets are fewer, simpler and more specific
> than those for CKRM.  Our high performance customers would want
> the cpuset hooks compiled in, not the more generic ones for
> CKRM (which they could not easily use for any other workload
> management purpose anyway, if the task class hierarchy were
> hijacked for the needs of cpusets, as noted above).
> 
> The development costs of cpusets so far, which are perhaps the
> best predictor we have of future costs, have been substantially
> lower than they have been for CKRM.

I think you have to compare the developmental cost of a resource 
controller providing cpusetfunctionality and not ckrm itself.
> 
> In sum, your proposal costs alot more than cpusets, by a variety
> of metrics.
> 
> =================================================
> 
> In summary, I find that your cpuset/memset CKRM proposal provides
> little or no benefit past the simpler cpu and memory placement
> calls already available, while costing substantially more in
> a variety of ways than my cpuset proposal, when evaluated for
> its usefulness for numa placement.
> 
> (Of course, if evaluated for suitability for workload management,
> the table is turned, and your CKRM patch provides essential
> capability that my cpuset patch could never dream of doing.)
> 
> Moreover, the additional workload management benefits that your
> CKRM facility provides, and that some of my customers might
> want to use in combination with numa placement, would probably
> become unavailable to them if we integrated cpusets and CKRM,
> because cpusets would have to hijack the task class hierarchy
> for its own nefarious purposes.
> 
> Such an attempt to integrate cpusets and CKRM would be a major
> setback for cpusets, substantially increasing its costs and
> reducing is value, probably well past the point of it even being
> worth persuing further, in the mainstream kernel.  Adding all
> that foreign logic of cpusets to the CKRM patch probably
> wouldn't help CKRM much either.  The CKRM patch is already one
> that requires a bright mind and some careful thought to master.

If one reads the design and then looks at the broken down patches,
it may not be hard.

> Adding cpuset numa placement logic, which is typically different
> in detail, would add a complexity burden to the CKRM code that
> would serve no one well.
> 
> 
> > Note that I am not pitching for a marriage
> 
> We agree.
> 
> I just took more words to say it ').

The reasons we quote also are very different though. I meant that it
won't be a happy, productive marriage.

But, I infer that you are suggesting their species itself are
different, which I do not agree.

chandra
PS to everyone else: Wow, you have lot of patience :)
> 
> 
> 
> -- 
>                   I won't rest till it's the best ...
>                   Programmer, Linux Scalability
>                   Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement
  2005-02-12  1:37                                       ` Chandra Seetharaman
@ 2005-02-12  6:16                                         ` Paul Jackson
  0 siblings, 0 replies; 53+ messages in thread
From: Paul Jackson @ 2005-02-12  6:16 UTC (permalink / raw)
  To: Chandra Seetharaman
  Cc: colpatch, dino, mbligh, pwil3058, frankeh, dipankar, akpm,
	ckrm-tech, efocht, lse-tech, hch, steiner, jbarnes,
	sylvain.jeaugey, djh, linux-kernel, Simon.Derr, ak, sivanich

I agree with 97% of what you write, Chandra.

> one more level of indirection(instead of task->cpuset->cpus_allowed
> it will be task->taskclass->res[CPUSET]->cpus_allowed).

No -- two more levels of indirection (task->cpus_allowed becomes
task->taskclass->res[CPUSET]->cpus_allowed).

> But, for your purposes or our discussions one would need only 3 modules
> of the above (core, rcfs and taskclass). 

Ok.  That was not obvious to me until now.  If there is a section in
your documentation that explains this, and addresses the needs and
motivations of someone trying to reuse portions of CKRM in such a
manner, I missed it.  Whatever ...

In any case, on the issue that matters to me right now, we agree:

> It won't be a happy, productive marriage.

Good.  Thanks.  Good luck to you.

> PS to everyone else: Wow, you have lot of patience :)

For sure.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2005-02-12  6:16 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-10-05  6:05 [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement Stan Hoeppner
     [not found] <20041007072842.2bafc320.pj@sgi.com>
2004-10-07 19:05 ` Rick Lindsley
2004-10-10  2:15   ` [ckrm-tech] " Paul Jackson
2004-10-11 22:06     ` Matthew Dobson
2004-10-11 22:58       ` Paul Jackson
2004-10-12 21:22         ` Matthew Dobson
2004-10-12  8:50       ` Simon Derr
2004-10-12 21:25         ` Matthew Dobson
2004-10-10  2:28   ` Paul Jackson
  -- strict thread matches above, loose matches on Subject: below --
2004-10-04  0:45 Paul Jackson
2004-10-04 11:44 ` Rick Lindsley
2004-10-04 22:46   ` [ckrm-tech] " Paul Jackson
2004-08-05 10:08 [PATCH] new bitmap list format (for cpusets) Paul Jackson
2004-08-06  2:05 ` [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement Paul Jackson
2004-08-06  3:24   ` Martin J. Bligh
2004-08-06 15:30 ` Erich Focht
2004-08-07  6:10   ` Paul Jackson
2004-08-07 15:22     ` Erich Focht
2004-08-08 20:22       ` Shailabh Nagar
2004-08-09 15:57         ` Hubertus Franke
2004-08-10 11:31           ` [ckrm-tech] " Paul Jackson
2004-08-10 22:38             ` Shailabh Nagar
2004-08-11 10:42               ` Erich Focht
2004-08-11 14:56                 ` Shailabh Nagar
2004-08-14  8:51               ` Paul Jackson
2004-08-08 19:58     ` Shailabh Nagar
2004-10-01 23:41       ` Andrew Morton
2004-10-02  6:06         ` Paul Jackson
2004-10-02 14:55           ` Dipankar Sarma
2004-10-02 16:14             ` Hubertus Franke
2004-10-02 23:21               ` Peter Williams
2004-10-02 23:44                 ` Hubertus Franke
2004-10-05  3:13                   ` [ckrm-tech] " Matthew Helsley
2004-10-05  8:30                     ` Hubertus Franke
2004-10-05 14:20                       ` Paul Jackson
2004-10-03 14:36                 ` Martin J. Bligh
2004-10-03 15:39                   ` Paul Jackson
2004-10-03 23:53                     ` Martin J. Bligh
2004-10-04  0:02                       ` Martin J. Bligh
2004-10-04  0:53                         ` Paul Jackson
2004-10-04  3:56                           ` Martin J. Bligh
2004-10-04  4:24                             ` Paul Jackson
2004-10-04 15:03                               ` Martin J. Bligh
2004-10-04 15:53                                 ` [ckrm-tech] " Paul Jackson
2004-10-04 18:17                                   ` Martin J. Bligh
2004-10-04 20:25                                     ` Paul Jackson
2004-10-04 22:15                                       ` Martin J. Bligh
2004-10-05  9:17                                         ` Paul Jackson
2004-10-05 10:01                                           ` Paul Jackson
2004-10-05 22:24                                           ` Matthew Dobson
2004-10-05  9:26                                 ` Simon Derr
2004-10-05 19:34                                   ` Martin J. Bligh
2004-10-06  0:28                                     ` Paul Jackson
2004-10-06  1:16                                       ` Martin J. Bligh
2004-10-06  2:08                                         ` Paul Jackson
2004-10-06 22:59                                           ` Matthew Dobson
2004-10-07  8:51                                             ` Paul Jackson
2004-10-07 12:47                                               ` Simon Derr
2004-10-07 14:49                                                 ` Martin J. Bligh
2004-10-07 17:54                                                   ` Paul Jackson
2004-10-07 18:25                                                     ` Andrew Morton
2004-10-07 19:52                                                       ` Paul Jackson
2004-10-07 21:04                                                         ` [ckrm-tech] " Matthew Helsley
2004-10-10  3:22                                                       ` Paul Jackson
2004-10-10  5:12                                                   ` Paul Jackson
2004-10-05 22:33                                   ` Matthew Dobson
2004-10-06  3:01                                     ` Paul Jackson
2004-10-06 23:12                                       ` Matthew Dobson
2004-10-07  8:59                                         ` [ckrm-tech] " Paul Jackson
2004-10-05 22:19                       ` Matthew Dobson
2004-10-06  2:39                         ` Paul Jackson
2004-10-06 23:21                           ` Matthew Dobson
2004-10-07  9:41                             ` [ckrm-tech] " Paul Jackson
2005-02-07 23:59                         ` Matthew Dobson
2005-02-08  9:54                           ` Dinakar Guniguntala
2005-02-08 19:00                             ` Matthew Dobson
2005-02-08 20:42                               ` Paul Jackson
2005-02-09 17:59                                 ` [ckrm-tech] " Chandra Seetharaman
2005-02-11  2:46                                   ` Chandra Seetharaman
2005-02-11  9:21                                     ` Paul Jackson
2005-02-12  1:37                                       ` Chandra Seetharaman
2005-02-12  6:16                                         ` Paul Jackson
2005-02-11 16:54                                     ` Jesse Barnes
2005-02-11 18:42                                       ` Chandra Seetharaman
2005-02-11 18:50                                         ` Jesse Barnes
2004-10-02 15:46         ` Marc E. Fiuczynski
2004-10-02 16:17           ` Hubertus Franke
2004-10-02 17:53             ` Paul Jackson
2004-10-02 18:16               ` Hubertus Franke
2004-10-02 19:14                 ` Paul Jackson
2004-10-02 23:29                 ` Peter Williams
2004-10-02 23:51                   ` Hubertus Franke
2004-10-02 20:40             ` Andrew Morton
2004-10-02 23:08               ` Hubertus Franke
2004-10-02 22:26                 ` Alan Cox
2004-10-03  2:49                 ` Paul Jackson
2004-10-03 12:19                   ` Hubertus Franke
2004-10-03  3:25                 ` Paul Jackson
2004-10-03  2:26               ` Paul Jackson
2004-10-03 14:11                 ` Paul Jackson
2004-10-02 17:47           ` Paul Jackson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox