* [LSF/MM TOPIC] Beyond NUMA @ 2013-04-12 0:29 Dan Magenheimer 2013-04-14 21:39 ` James Bottomley 2013-04-15 0:39 ` Ric Mason 0 siblings, 2 replies; 8+ messages in thread From: Dan Magenheimer @ 2013-04-12 0:29 UTC (permalink / raw) To: lsf; +Cc: linux-mm MM developers and all -- It's a bit late to add a topic, but with such a great group of brains together, it seems worthwhile to spend at least some time speculating on "farther-out" problems. So I propose for the MM track: Beyond NUMA NUMA now impacts even the smallest servers and soon, perhaps even embedded systems, but the performance effects are limited when the number of nodes is small (e.g. two). As the number of nodes grows, along with the number of memory controllers, NUMA can have a big performance impact and the MM community has invested a huge amount of energy into reducing this problem. But as the number of memory controllers grows, the cost of the system grows faster. This is classic "scale-up" and certain workloads will always benefit from having as many CPUs/cores and nodes as can be packed into a single system. System vendors are happy to oblige because the profit margin on scale-out systems can be proportionally much much larger than on smaller commodity systems. So the NUMA work will always be necessary and important. But as scale-out grows to previously unimaginable levels, an increasing fraction of workloads are unable to adequately benefit to compensate for the non-linear increase in system cost. And so more users, especially cost-sensitive users, are turning instead to scale-out to optimize cost vs benefit for their massive data centers. Recent examples include HP's Moonshot and Facebook's "Group Hug". And even major data center topology changes are being proposed which use super-high-speed links to separate CPUs from RAM [1]. While filesystems and storage have long ago adapted to handle large numbers of servers effectively, the MM subsystem is still isolated, managing its own private set of RAM, independent of and completely partitioned from the RAM of other servers. Perhaps we, the Linux MM developers, should start considering how MM can evolve in this new world. In some ways, scale-out is like NUMA, but a step beyond. In other ways, scale-out is very different. The ramster project [2] in the staging tree is a step in the direction of "clusterizing" RAM, but may or may not be the right step. Discuss. [1] http://allthingsd.com/20130410/intel-wants-to-redesign-your-server-rack/ [2] http://lwn.net/Articles/481681/ (see y'all next week!) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM TOPIC] Beyond NUMA 2013-04-12 0:29 [LSF/MM TOPIC] Beyond NUMA Dan Magenheimer @ 2013-04-14 21:39 ` James Bottomley 2013-04-14 23:49 ` [Lsf] " Dave Chinner 2013-04-15 15:28 ` Dan Magenheimer 2013-04-15 0:39 ` Ric Mason 1 sibling, 2 replies; 8+ messages in thread From: James Bottomley @ 2013-04-14 21:39 UTC (permalink / raw) To: Dan Magenheimer; +Cc: lsf, linux-mm On Thu, 2013-04-11 at 17:29 -0700, Dan Magenheimer wrote: > MM developers and all -- > > It's a bit late to add a topic, but with such a great group of brains > together, it seems worthwhile to spend at least some time speculating > on "farther-out" problems. So I propose for the MM track: > > Beyond NUMA > > NUMA now impacts even the smallest servers and soon, perhaps even embedded > systems, but the performance effects are limited when the number of nodes > is small (e.g. two). As the number of nodes grows, along with the number > of memory controllers, NUMA can have a big performance impact and the MM > community has invested a huge amount of energy into reducing this problem. > > But as the number of memory controllers grows, the cost of the system > grows faster. This is classic "scale-up" and certain workloads will > always benefit from having as many CPUs/cores and nodes as can be > packed into a single system. System vendors are happy to oblige because the > profit margin on scale-out systems can be proportionally much much > larger than on smaller commodity systems. So the NUMA work will always > be necessary and important. > > But as scale-out grows to previously unimaginable levels, an increasing > fraction of workloads are unable to adequately benefit to compensate > for the non-linear increase in system cost. And so more users, especially > cost-sensitive users, are turning instead to scale-out to optimize > cost vs benefit for their massive data centers. Recent examples include > HP's Moonshot and Facebook's "Group Hug". And even major data center > topology changes are being proposed which use super-high-speed links to > separate CPUs from RAM [1]. > > While filesystems and storage have long ago adapted to handle large > numbers of servers effectively, the MM subsystem is still isolated, > managing its own private set of RAM, independent of and completely > partitioned from the RAM of other servers. Perhaps we, the Linux > MM developers, should start considering how MM can evolve in this > new world. In some ways, scale-out is like NUMA, but a step beyond. > In other ways, scale-out is very different. The ramster project [2] > in the staging tree is a step in the direction of "clusterizing" RAM, > but may or may not be the right step. I've got to say from a physics, rather than mm perspective, this sounds to be a really badly framed problem. We seek to eliminate complexity by simplification. What this often means is that even though the theory allows us to solve a problem in an arbitrary frame, there's usually a nice one where it looks a lot simpler (that's what the whole game of eigenvector mathematics and group characters is all about). Saying we need to consider remote in-use memory as high numa and manage it from a local node looks a lot like saying we need to consider a problem in an arbitrary frame rather than looking for the simplest one. The fact of the matter is that network remote memory has latency orders of magnitude above local ... the effect is so distinct, it's not even worth calling it NUMA. It does seem then that the correct frame to consider this in is local + remote separately with a hierarchical management (the massive difference in latencies makes this a simple observation from perturbation theory). Amazingly this is what current clustering tools tend to do, so I don't really see there's much here to add to the current practice. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Lsf] [LSF/MM TOPIC] Beyond NUMA 2013-04-14 21:39 ` James Bottomley @ 2013-04-14 23:49 ` Dave Chinner 2013-04-15 1:52 ` James Bottomley 2013-04-15 15:28 ` Dan Magenheimer 1 sibling, 1 reply; 8+ messages in thread From: Dave Chinner @ 2013-04-14 23:49 UTC (permalink / raw) To: James Bottomley; +Cc: Dan Magenheimer, lsf, linux-mm On Sun, Apr 14, 2013 at 02:39:50PM -0700, James Bottomley wrote: > On Thu, 2013-04-11 at 17:29 -0700, Dan Magenheimer wrote: > > MM developers and all -- > > > > It's a bit late to add a topic, but with such a great group of brains > > together, it seems worthwhile to spend at least some time speculating > > on "farther-out" problems. So I propose for the MM track: > > > > Beyond NUMA > > > > NUMA now impacts even the smallest servers and soon, perhaps even embedded > > systems, but the performance effects are limited when the number of nodes > > is small (e.g. two). As the number of nodes grows, along with the number > > of memory controllers, NUMA can have a big performance impact and the MM > > community has invested a huge amount of energy into reducing this problem. > > > > But as the number of memory controllers grows, the cost of the system > > grows faster. This is classic "scale-up" and certain workloads will > > always benefit from having as many CPUs/cores and nodes as can be > > packed into a single system. System vendors are happy to oblige because the > > profit margin on scale-out systems can be proportionally much much > > larger than on smaller commodity systems. So the NUMA work will always > > be necessary and important. > > > > But as scale-out grows to previously unimaginable levels, an increasing > > fraction of workloads are unable to adequately benefit to compensate > > for the non-linear increase in system cost. And so more users, especially > > cost-sensitive users, are turning instead to scale-out to optimize > > cost vs benefit for their massive data centers. Recent examples include > > HP's Moonshot and Facebook's "Group Hug". And even major data center > > topology changes are being proposed which use super-high-speed links to > > separate CPUs from RAM [1]. > > > > While filesystems and storage have long ago adapted to handle large > > numbers of servers effectively, the MM subsystem is still isolated, > > managing its own private set of RAM, independent of and completely > > partitioned from the RAM of other servers. Perhaps we, the Linux > > MM developers, should start considering how MM can evolve in this > > new world. In some ways, scale-out is like NUMA, but a step beyond. > > In other ways, scale-out is very different. The ramster project [2] > > in the staging tree is a step in the direction of "clusterizing" RAM, > > but may or may not be the right step. > > I've got to say from a physics, rather than mm perspective, this sounds > to be a really badly framed problem. We seek to eliminate complexity by > simplification. What this often means is that even though the theory > allows us to solve a problem in an arbitrary frame, there's usually a > nice one where it looks a lot simpler (that's what the whole game of > eigenvector mathematics and group characters is all about). > > Saying we need to consider remote in-use memory as high numa and manage > it from a local node looks a lot like saying we need to consider a > problem in an arbitrary frame rather than looking for the simplest one. > The fact of the matter is that network remote memory has latency orders > of magnitude above local ... the effect is so distinct, it's not even > worth calling it NUMA. It does seem then that the correct frame to > consider this in is local + remote separately with a hierarchical > management (the massive difference in latencies makes this a simple > observation from perturbation theory). Amazingly this is what current > clustering tools tend to do, so I don't really see there's much here to > add to the current practice. Everyone who wants to talk about this topic should google "vNUMA" and read the research papers from a few years ago. It gives pretty good insight in the practicality of treating the RAM in a cluster as a single virtual NUMA machine with a large distance factor. And then there's the crazy guys that have been trying to implement DLM (distributed large memory) using kernel based MPI communication for cache coherency protocols at page fault level.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Lsf] [LSF/MM TOPIC] Beyond NUMA 2013-04-14 23:49 ` [Lsf] " Dave Chinner @ 2013-04-15 1:52 ` James Bottomley 2013-04-15 20:47 ` Dan Magenheimer 0 siblings, 1 reply; 8+ messages in thread From: James Bottomley @ 2013-04-15 1:52 UTC (permalink / raw) To: Dave Chinner; +Cc: lsf, linux-mm, Dan Magenheimer On Mon, 2013-04-15 at 09:49 +1000, Dave Chinner wrote: > > I've got to say from a physics, rather than mm perspective, this sounds > > to be a really badly framed problem. We seek to eliminate complexity by > > simplification. What this often means is that even though the theory > > allows us to solve a problem in an arbitrary frame, there's usually a > > nice one where it looks a lot simpler (that's what the whole game of > > eigenvector mathematics and group characters is all about). > > > > Saying we need to consider remote in-use memory as high numa and manage > > it from a local node looks a lot like saying we need to consider a > > problem in an arbitrary frame rather than looking for the simplest one. > > The fact of the matter is that network remote memory has latency orders > > of magnitude above local ... the effect is so distinct, it's not even > > worth calling it NUMA. It does seem then that the correct frame to > > consider this in is local + remote separately with a hierarchical > > management (the massive difference in latencies makes this a simple > > observation from perturbation theory). Amazingly this is what current > > clustering tools tend to do, so I don't really see there's much here to > > add to the current practice. > > Everyone who wants to talk about this topic should google "vNUMA" > and read the research papers from a few years ago. It gives pretty > good insight in the practicality of treating the RAM in a cluster as > a single virtual NUMA machine with a large distance factor. Um yes, insert comment about crazy Australians. vNUMA was doomed to failure from the beginning, I think, because they tried to maintain coherency across the systems. The paper contains a nicely understated expression of disappointment that the resulting system was so slow. I'm sure, as an ex-SGI person, you'd agree with me that high numa across network is possible ... but only with a boatload of hardware acceleration like the altix had. > And then there's the crazy guys that have been trying to implement > DLM (distributed large memory) using kernel based MPI communication > for cache coherency protocols at page fault level.... I have to confess to being one of those crazy people way back when I was at bell labs in the 90s ... it was mostly a curiosity until it found a use in distributed databases. But the question still stands: The current vogue for clustering is locally managed resources coupled to a resource hierarchy to try and get away from the entanglement factors that cause the problems that vNUMA saw ... what I don't get from this topic is what it will add to the current state of the art or more truthfully what I get is it seems to be advocating going backwards ... James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [Lsf] [LSF/MM TOPIC] Beyond NUMA 2013-04-15 1:52 ` James Bottomley @ 2013-04-15 20:47 ` Dan Magenheimer 2013-04-15 20:50 ` H. Peter Anvin 0 siblings, 1 reply; 8+ messages in thread From: Dan Magenheimer @ 2013-04-15 20:47 UTC (permalink / raw) To: James Bottomley, Dave Chinner; +Cc: lsf, linux-mm > From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com] > Subject: Re: [Lsf] [LSF/MM TOPIC] Beyond NUMA > > what I don't get from this topic is what it will add to the > current state of the art or more truthfully what I get is it seems to be > advocating going backwards ... > > James Heh, I think this industry is more of a pendulum than a directed vector, so today's backwards may be tomorrow's great leap forward. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Lsf] [LSF/MM TOPIC] Beyond NUMA 2013-04-15 20:47 ` Dan Magenheimer @ 2013-04-15 20:50 ` H. Peter Anvin 0 siblings, 0 replies; 8+ messages in thread From: H. Peter Anvin @ 2013-04-15 20:50 UTC (permalink / raw) To: Dan Magenheimer; +Cc: James Bottomley, Dave Chinner, lsf, linux-mm On 04/15/2013 01:47 PM, Dan Magenheimer wrote: > > Heh, I think this industry is more of a pendulum than > a directed vector, so today's backwards may be tomorrow's > great leap forward. > Well, it is more that the direction of the industry is affected by external and technological factors that are continually in flux. As such the attractor point tends to shift long before it is reached. -hpa -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [Lsf] [LSF/MM TOPIC] Beyond NUMA 2013-04-14 21:39 ` James Bottomley 2013-04-14 23:49 ` [Lsf] " Dave Chinner @ 2013-04-15 15:28 ` Dan Magenheimer 1 sibling, 0 replies; 8+ messages in thread From: Dan Magenheimer @ 2013-04-15 15:28 UTC (permalink / raw) To: James Bottomley; +Cc: lsf, linux-mm > From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com] > Subject: Re: [Lsf] [LSF/MM TOPIC] Beyond NUMA > > On Thu, 2013-04-11 at 17:29 -0700, Dan Magenheimer wrote: > > MM developers and all -- > > > > It's a bit late to add a topic, but with such a great group of brains > > together, it seems worthwhile to spend at least some time speculating > > on "farther-out" problems. So I propose for the MM track: > > > > Beyond NUMA > > > > NUMA now impacts even the smallest servers and soon, perhaps even embedded > > systems, but the performance effects are limited when the number of nodes > > is small (e.g. two). As the number of nodes grows, along with the number > > of memory controllers, NUMA can have a big performance impact and the MM > > community has invested a huge amount of energy into reducing this problem. > > > > But as the number of memory controllers grows, the cost of the system > > grows faster. This is classic "scale-up" and certain workloads will > > always benefit from having as many CPUs/cores and nodes as can be > > packed into a single system. System vendors are happy to oblige because the > > profit margin on scale-out systems can be proportionally much much > > larger than on smaller commodity systems. So the NUMA work will always > > be necessary and important. > > > > But as scale-out grows to previously unimaginable levels, an increasing > > fraction of workloads are unable to adequately benefit to compensate > > for the non-linear increase in system cost. And so more users, especially > > cost-sensitive users, are turning instead to scale-out to optimize > > cost vs benefit for their massive data centers. Recent examples include > > HP's Moonshot and Facebook's "Group Hug". And even major data center > > topology changes are being proposed which use super-high-speed links to > > separate CPUs from RAM [1]. > > > > While filesystems and storage have long ago adapted to handle large > > numbers of servers effectively, the MM subsystem is still isolated, > > managing its own private set of RAM, independent of and completely > > partitioned from the RAM of other servers. Perhaps we, the Linux > > MM developers, should start considering how MM can evolve in this > > new world. In some ways, scale-out is like NUMA, but a step beyond. > > In other ways, scale-out is very different. The ramster project [2] > > in the staging tree is a step in the direction of "clusterizing" RAM, > > but may or may not be the right step. > > I've got to say from a physics, rather than mm perspective, this sounds > to be a really badly framed problem. We seek to eliminate complexity by > simplification. What this often means is that even though the theory > allows us to solve a problem in an arbitrary frame, there's usually a > nice one where it looks a lot simpler (that's what the whole game of > eigenvector mathematics and group characters is all about). > > Saying we need to consider remote in-use memory as high numa and manage > it from a local node looks a lot like saying we need to consider a > problem in an arbitrary frame rather than looking for the simplest one. > The fact of the matter is that network remote memory has latency orders > of magnitude above local ... the effect is so distinct, it's not even > worth calling it NUMA. It does seem then that the correct frame to > consider this in is local + remote separately with a hierarchical > management (the massive difference in latencies makes this a simple > observation from perturbation theory). Amazingly this is what current > clustering tools tend to do, so I don't really see there's much here to > add to the current practice. Hi James -- Key point... > The fact of the matter is that network remote memory has latency orders > of magnitude above local ... the effect is so distinct, it's not even > worth calling it NUMA. I didn't say "network remote memory", though I suppose the underlying fabric might support TCP/IP traffic as well. If there is a "fast connection" between the nodes or from nodes to a "memory server", and it is NOT cache-coherent, and the addressable unit is much larger than a byte (i.e. perhaps a page), the "frame" is not so arbitrary. For example "RDMA'ing" a page from one node's RAM to another node's RAM might not be much slower than copying a page on a large ccNUMA machine, and still orders of magnitude faster than paging-in or swapping-in from remote storage. And just as today's kernel NUMA code attempts to anticipate if/when data will be needed and copy it from remote NUMA node to local NUMA node, this "RDMA-ish" technique could do the same between cooperating kernels on different machines. In other words, I'm positing a nice "correct frame" which, given changes in system topology, fits between current ccNUMA machines and JBON (just a bunch of nodes, connected via LAN), and proposing that maybe the MM subsystem could be not only aware of it but actively participate in it. As I said, ramster is one such possibility... I'm wondering if there are more and, if so, better ones. Does that make more sense? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM TOPIC] Beyond NUMA 2013-04-12 0:29 [LSF/MM TOPIC] Beyond NUMA Dan Magenheimer 2013-04-14 21:39 ` James Bottomley @ 2013-04-15 0:39 ` Ric Mason 1 sibling, 0 replies; 8+ messages in thread From: Ric Mason @ 2013-04-15 0:39 UTC (permalink / raw) To: Dan Magenheimer; +Cc: lsf, linux-mm Hi Dan, On 04/12/2013 08:29 AM, Dan Magenheimer wrote: > MM developers and all -- > > It's a bit late to add a topic, but with such a great group of brains > together, it seems worthwhile to spend at least some time speculating > on "farther-out" problems. So I propose for the MM track: > > Beyond NUMA > > NUMA now impacts even the smallest servers and soon, perhaps even embedded > systems, but the performance effects are limited when the number of nodes > is small (e.g. two). As the number of nodes grows, along with the number > of memory controllers, NUMA can have a big performance impact and the MM > community has invested a huge amount of energy into reducing this problem. > > But as the number of memory controllers grows, the cost of the system > grows faster. This is classic "scale-up" and certain workloads will > always benefit from having as many CPUs/cores and nodes as can be > packed into a single system. System vendors are happy to oblige because the > profit margin on scale-out systems can be proportionally much much > larger than on smaller commodity systems. So the NUMA work will always > be necessary and important. > > But as scale-out grows to previously unimaginable levels, an increasing > fraction of workloads are unable to adequately benefit to compensate > for the non-linear increase in system cost. And so more users, especially > cost-sensitive users, are turning instead to scale-out to optimize > cost vs benefit for their massive data centers. Recent examples include > HP's Moonshot and Facebook's "Group Hug". And even major data center > topology changes are being proposed which use super-high-speed links to > separate CPUs from RAM [1]. > > While filesystems and storage have long ago adapted to handle large > numbers of servers effectively, the MM subsystem is still isolated, > managing its own private set of RAM, independent of and completely > partitioned from the RAM of other servers. Perhaps we, the Linux > MM developers, should start considering how MM can evolve in this > new world. In some ways, scale-out is like NUMA, but a step beyond. > In other ways, scale-out is very different. The ramster project [2] > in the staging tree is a step in the direction of "clusterizing" RAM, > but may or may not be the right step. If I configure UMA machine to fake numa, is there benefit or impact performance? > > Discuss. > > [1] http://allthingsd.com/20130410/intel-wants-to-redesign-your-server-rack/ > [2] http://lwn.net/Articles/481681/ > > (see y'all next week!) > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=ilto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2013-04-15 20:50 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-04-12 0:29 [LSF/MM TOPIC] Beyond NUMA Dan Magenheimer 2013-04-14 21:39 ` James Bottomley 2013-04-14 23:49 ` [Lsf] " Dave Chinner 2013-04-15 1:52 ` James Bottomley 2013-04-15 20:47 ` Dan Magenheimer 2013-04-15 20:50 ` H. Peter Anvin 2013-04-15 15:28 ` Dan Magenheimer 2013-04-15 0:39 ` Ric Mason
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).