From: Dan Magenheimer <dan.magenheimer@oracle.com>
To: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: lsf@lists.linux-foundation.org, linux-mm@kvack.org
Subject: RE: [Lsf] [LSF/MM TOPIC] Beyond NUMA
Date: Mon, 15 Apr 2013 08:28:46 -0700 (PDT) [thread overview]
Message-ID: <205ad2fe-50e6-4275-b90d-783e9f6c6984@default> (raw)
In-Reply-To: <1365975590.2359.22.camel@dabdike>
> From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com]
> Subject: Re: [Lsf] [LSF/MM TOPIC] Beyond NUMA
>
> On Thu, 2013-04-11 at 17:29 -0700, Dan Magenheimer wrote:
> > MM developers and all --
> >
> > It's a bit late to add a topic, but with such a great group of brains
> > together, it seems worthwhile to spend at least some time speculating
> > on "farther-out" problems. So I propose for the MM track:
> >
> > Beyond NUMA
> >
> > NUMA now impacts even the smallest servers and soon, perhaps even embedded
> > systems, but the performance effects are limited when the number of nodes
> > is small (e.g. two). As the number of nodes grows, along with the number
> > of memory controllers, NUMA can have a big performance impact and the MM
> > community has invested a huge amount of energy into reducing this problem.
> >
> > But as the number of memory controllers grows, the cost of the system
> > grows faster. This is classic "scale-up" and certain workloads will
> > always benefit from having as many CPUs/cores and nodes as can be
> > packed into a single system. System vendors are happy to oblige because the
> > profit margin on scale-out systems can be proportionally much much
> > larger than on smaller commodity systems. So the NUMA work will always
> > be necessary and important.
> >
> > But as scale-out grows to previously unimaginable levels, an increasing
> > fraction of workloads are unable to adequately benefit to compensate
> > for the non-linear increase in system cost. And so more users, especially
> > cost-sensitive users, are turning instead to scale-out to optimize
> > cost vs benefit for their massive data centers. Recent examples include
> > HP's Moonshot and Facebook's "Group Hug". And even major data center
> > topology changes are being proposed which use super-high-speed links to
> > separate CPUs from RAM [1].
> >
> > While filesystems and storage have long ago adapted to handle large
> > numbers of servers effectively, the MM subsystem is still isolated,
> > managing its own private set of RAM, independent of and completely
> > partitioned from the RAM of other servers. Perhaps we, the Linux
> > MM developers, should start considering how MM can evolve in this
> > new world. In some ways, scale-out is like NUMA, but a step beyond.
> > In other ways, scale-out is very different. The ramster project [2]
> > in the staging tree is a step in the direction of "clusterizing" RAM,
> > but may or may not be the right step.
>
> I've got to say from a physics, rather than mm perspective, this sounds
> to be a really badly framed problem. We seek to eliminate complexity by
> simplification. What this often means is that even though the theory
> allows us to solve a problem in an arbitrary frame, there's usually a
> nice one where it looks a lot simpler (that's what the whole game of
> eigenvector mathematics and group characters is all about).
>
> Saying we need to consider remote in-use memory as high numa and manage
> it from a local node looks a lot like saying we need to consider a
> problem in an arbitrary frame rather than looking for the simplest one.
> The fact of the matter is that network remote memory has latency orders
> of magnitude above local ... the effect is so distinct, it's not even
> worth calling it NUMA. It does seem then that the correct frame to
> consider this in is local + remote separately with a hierarchical
> management (the massive difference in latencies makes this a simple
> observation from perturbation theory). Amazingly this is what current
> clustering tools tend to do, so I don't really see there's much here to
> add to the current practice.
Hi James --
Key point...
> The fact of the matter is that network remote memory has latency orders
> of magnitude above local ... the effect is so distinct, it's not even
> worth calling it NUMA.
I didn't say "network remote memory", though I suppose the underlying
fabric might support TCP/IP traffic as well. If there is a "fast
connection" between the nodes or from nodes to a "memory server", and
it is NOT cache-coherent, and the addressable unit is much larger
than a byte (i.e. perhaps a page), the "frame" is not so arbitrary.
For example "RDMA'ing" a page from one node's RAM to another
node's RAM might not be much slower than copying a page on
a large ccNUMA machine, and still orders of magnitude faster
than paging-in or swapping-in from remote storage. And just
as today's kernel NUMA code attempts to anticipate if/when
data will be needed and copy it from remote NUMA node to local
NUMA node, this "RDMA-ish" technique could do the same between
cooperating kernels on different machines.
In other words, I'm positing a nice "correct frame" which, given
changes in system topology, fits between current ccNUMA machines
and JBON (just a bunch of nodes, connected via LAN), and proposing
that maybe the MM subsystem could be not only aware of it but
actively participate in it.
As I said, ramster is one such possibility... I'm wondering if
there are more and, if so, better ones.
Does that make more sense?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2013-04-15 15:28 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-04-12 0:29 [LSF/MM TOPIC] Beyond NUMA Dan Magenheimer
2013-04-14 21:39 ` James Bottomley
2013-04-14 23:49 ` [Lsf] " Dave Chinner
2013-04-15 1:52 ` James Bottomley
2013-04-15 20:47 ` Dan Magenheimer
2013-04-15 20:50 ` H. Peter Anvin
2013-04-15 15:28 ` Dan Magenheimer [this message]
2013-04-15 0:39 ` Ric Mason
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=205ad2fe-50e6-4275-b90d-783e9f6c6984@default \
--to=dan.magenheimer@oracle.com \
--cc=James.Bottomley@HansenPartnership.com \
--cc=linux-mm@kvack.org \
--cc=lsf@lists.linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).