From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============8183624083578857084==" MIME-Version: 1.0 From: Ross Zwisler Subject: Re: [Devel] [RFC v2 0/5] surface heterogeneous memory performance information Date: Fri, 07 Jul 2017 10:25:12 -0600 Message-ID: <20170707162512.GA22856@linux.intel.com> In-Reply-To: 1499408836.23251.3.camel@gmail.com List-ID: To: devel@acpica.org --===============8183624083578857084== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable On Fri, Jul 07, 2017 at 04:27:16PM +1000, Balbir Singh wrote: > On Thu, 2017-07-06 at 15:52 -0600, Ross Zwisler wrote: > > =3D=3D=3D=3D Quick Summary =3D=3D=3D=3D > > = > > Platforms in the very near future will have multiple types of memory > > attached to a single CPU. These disparate memory ranges will have some > > characteristics in common, such as CPU cache coherence, but they can ha= ve > > wide ranges of performance both in terms of latency and bandwidth. > > = > > For example, consider a system that contains persistent memory, standard > > DDR memory and High Bandwidth Memory (HBM), all attached to the same CP= U. > > There could potentially be an order of magnitude or more difference in > > performance between the slowest and fastest memory attached to that CPU. > > = > > With the current Linux code NUMA nodes are CPU-centric, so all the memo= ry > > attached to a given CPU will be lumped into the same NUMA node. This m= akes > > it very difficult for userspace applications to understand the performa= nce > > of different memory ranges on a given CPU. > > = > > We solve this issue by providing userspace with performance information= on > > individual memory ranges. This performance information is exposed via > > sysfs: > > = > > # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null > > mem_tgt2/firmware_id:1 > > mem_tgt2/is_cached:0 > > mem_tgt2/is_enabled:1 > > mem_tgt2/is_isolated:0 > = > Could you please explain these charactersitics, are they in the patches > to follow? Yea, sorry, these do need more explanation. These values are derived from = the ACPI SRAT/HMAT tables: > > mem_tgt2/firmware_id:1 This is the proximity domain, as defined in the SRAT and HMAT. Basically every ACPI proximity domain will end up being a unique NUMA node in Linux, = but the numbers may get reordered and Linux can create extra NUMA nodes that do= n't map back to ACPI proximity domains. So, this value is needed if anyone ever wants to look at the ACPI HMAT and SRAT tables directly and make sense of h= ow they map to NUMA nodes in Linux. > > mem_tgt2/is_cached:0 The HMAT provides lots of detailed information when a memory region has caching layers. For each layer of memory caching it has the ability to provide latency and bandwidth information for both reads and writes, information about the caching associativity (direct mapped, something more complex), the writeback policy (WB, WT), the cache line size, etc. For simplicity this sysfs interface doesn't expose that level of detail to = the user, and this flag just lets the user know whether the memory region they = are looking at has caching layers or not. Right now the additional details, if desired, can be gathered by looking at the raw tables. > > mem_tgt2/is_enabled:1 Tells whether the memory region is enabled, as defined by the flags in the SRAT. Actually, though, in this version of the patch series we don't create entries for CPUs or memory regions that aren't enabled, so this isn't neede= d. I'll remove for v3. > > mem_tgt2/is_isolated:0 This surfaces a flag in the HMAT's Memory Subsystem Address Range Structure: Bit [2]: Reservation hint=E2=80=94if set to 1, it is recommended that the operating system avoid placing allocations in this region if it cannot relocate (e.g. OS core memory management structures, OS core executable). Any allocations placed here should be able to be relocated (e.g. disk cache) if the memory is needed for another purpose. Adding kernel support for this hint (i.e. actually reserving the memory reg= ion during boot so it isn't used by the kernel or userspace, and is fully available for explicit allocation) is part of the future work that we'd do = in follow-on patch series. > > mem_tgt2/phys_addr_base:0x0 > > mem_tgt2/phys_length_bytes:0x800000000 > > mem_tgt2/local_init/read_bw_MBps:30720 > > mem_tgt2/local_init/read_lat_nsec:100 > > mem_tgt2/local_init/write_bw_MBps:30720 > > mem_tgt2/local_init/write_lat_nsec:100 > = > How to these numbers compare to normal system memory? These are garbage numbers that I made up in my hacked-up QEMU target. :) = > > This allows applications to easily find the memory that they want to us= e. > > We expect that the existing NUMA APIs will be enhanced to use this new > > information so that applications can continue to use them to select the= ir > > desired memory. > > = > > This series is built upon acpica-1705: > > = > > https://github.com/zetalog/linux/commits/acpica-1705 > > = > > And you can find a working tree here: > > = > > https://git.kernel.org/pub/scm/linux/kernel/git/zwisler/linux.git/log/?= h=3Dhmem_sysfs > > = > > =3D=3D=3D=3D Lots of Details =3D=3D=3D=3D > > = > > This patch set is only concerned with CPU-addressable memory types, not > > on-device memory like what we have with Jerome Glisse's HMM series: > > = > > https://lwn.net/Articles/726691/ > > = > > This patch set works by enabling the new Heterogeneous Memory Attribute > > Table (HMAT) table, newly defined in ACPI 6.2. One major conceptual cha= nge > > in ACPI 6.2 related to this work is that proximity domains no longer ne= ed > > to contain a processor. We can now have memory-only proximity domains, > > which means that we can now have memory-only Linux NUMA nodes. > > = > > Here is an example configuration where we have a single processor, one > > range of regular memory and one range of HBM: > > = > > +---------------+ +----------------+ > > | Processor | | Memory | > > | prox domain 0 +---+ prox domain 1 | > > | NUMA node 1 | | NUMA node 2 | > > +-------+-------+ +----------------+ > > | > > +-------+----------+ > > | HBM | > > | prox domain 2 | > > | NUMA node 0 | > > +------------------+ > > = > > This gives us one initiator (the processor) and two targets (the two me= mory > > ranges). Each of these three has its own ACPI proximity domain and > > associated Linux NUMA node. Note also that while there is a 1:1 mapping > > from each proximity domain to each NUMA node, the numbers don't necessa= rily > > match up. Additionally we can have extra NUMA nodes that don't map bac= k to > > ACPI proximity domains. > = > Could you expand on proximity domains, are they the same as node distance > or is this ACPI terminology for something more? I think I answered this above in my explanation of the "firmware_id" field, but please let me know if you have any more questions. Basically, a proxim= ity domain is an ACPI concept that is very similar to a Linux NUMA node, and ev= ery ACPI proximity domain generates and can be mapped to a unique Linux NUMA no= de. --===============8183624083578857084==-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: Re: [RFC v2 0/5] surface heterogeneous memory performance information Date: Fri, 7 Jul 2017 10:25:12 -0600 Message-ID: <20170707162512.GA22856@linux.intel.com> References: <20170706215233.11329-1-ross.zwisler@linux.intel.com> <1499408836.23251.3.camel@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Return-path: Received: from mga05.intel.com ([192.55.52.43]:39396 "EHLO mga05.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750892AbdGGQZP (ORCPT ); Fri, 7 Jul 2017 12:25:15 -0400 Content-Disposition: inline In-Reply-To: <1499408836.23251.3.camel@gmail.com> Sender: linux-acpi-owner@vger.kernel.org List-Id: linux-acpi@vger.kernel.org To: Balbir Singh Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "Anaczkowski, Lukasz" , "Box, David E" , "Kogut, Jaroslaw" , "Lahtinen, Joonas" , "Moore, Robert" , "Nachimuthu, Murugasamy" , "Odzioba, Lukasz" , "Rafael J. Wysocki" , "Rafael J. Wysocki" , "Schmauss, Erik" , "Verma, Vishal L" , "Zheng, Lv" , Andrew Morton , Dan Williams , Dave Hansen , Greg On Fri, Jul 07, 2017 at 04:27:16PM +1000, Balbir Singh wrote: > On Thu, 2017-07-06 at 15:52 -0600, Ross Zwisler wrote: > > ==== Quick Summary ==== > > > > Platforms in the very near future will have multiple types of memory > > attached to a single CPU. These disparate memory ranges will have some > > characteristics in common, such as CPU cache coherence, but they can have > > wide ranges of performance both in terms of latency and bandwidth. > > > > For example, consider a system that contains persistent memory, standard > > DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU. > > There could potentially be an order of magnitude or more difference in > > performance between the slowest and fastest memory attached to that CPU. > > > > With the current Linux code NUMA nodes are CPU-centric, so all the memory > > attached to a given CPU will be lumped into the same NUMA node. This makes > > it very difficult for userspace applications to understand the performance > > of different memory ranges on a given CPU. > > > > We solve this issue by providing userspace with performance information on > > individual memory ranges. This performance information is exposed via > > sysfs: > > > > # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null > > mem_tgt2/firmware_id:1 > > mem_tgt2/is_cached:0 > > mem_tgt2/is_enabled:1 > > mem_tgt2/is_isolated:0 > > Could you please explain these charactersitics, are they in the patches > to follow? Yea, sorry, these do need more explanation. These values are derived from the ACPI SRAT/HMAT tables: > > mem_tgt2/firmware_id:1 This is the proximity domain, as defined in the SRAT and HMAT. Basically every ACPI proximity domain will end up being a unique NUMA node in Linux, but the numbers may get reordered and Linux can create extra NUMA nodes that don't map back to ACPI proximity domains. So, this value is needed if anyone ever wants to look at the ACPI HMAT and SRAT tables directly and make sense of how they map to NUMA nodes in Linux. > > mem_tgt2/is_cached:0 The HMAT provides lots of detailed information when a memory region has caching layers. For each layer of memory caching it has the ability to provide latency and bandwidth information for both reads and writes, information about the caching associativity (direct mapped, something more complex), the writeback policy (WB, WT), the cache line size, etc. For simplicity this sysfs interface doesn't expose that level of detail to the user, and this flag just lets the user know whether the memory region they are looking at has caching layers or not. Right now the additional details, if desired, can be gathered by looking at the raw tables. > > mem_tgt2/is_enabled:1 Tells whether the memory region is enabled, as defined by the flags in the SRAT. Actually, though, in this version of the patch series we don't create entries for CPUs or memory regions that aren't enabled, so this isn't needed. I'll remove for v3. > > mem_tgt2/is_isolated:0 This surfaces a flag in the HMAT's Memory Subsystem Address Range Structure: Bit [2]: Reservation hint—if set to 1, it is recommended that the operating system avoid placing allocations in this region if it cannot relocate (e.g. OS core memory management structures, OS core executable). Any allocations placed here should be able to be relocated (e.g. disk cache) if the memory is needed for another purpose. Adding kernel support for this hint (i.e. actually reserving the memory region during boot so it isn't used by the kernel or userspace, and is fully available for explicit allocation) is part of the future work that we'd do in follow-on patch series. > > mem_tgt2/phys_addr_base:0x0 > > mem_tgt2/phys_length_bytes:0x800000000 > > mem_tgt2/local_init/read_bw_MBps:30720 > > mem_tgt2/local_init/read_lat_nsec:100 > > mem_tgt2/local_init/write_bw_MBps:30720 > > mem_tgt2/local_init/write_lat_nsec:100 > > How to these numbers compare to normal system memory? These are garbage numbers that I made up in my hacked-up QEMU target. :) > > This allows applications to easily find the memory that they want to use. > > We expect that the existing NUMA APIs will be enhanced to use this new > > information so that applications can continue to use them to select their > > desired memory. > > > > This series is built upon acpica-1705: > > > > https://github.com/zetalog/linux/commits/acpica-1705 > > > > And you can find a working tree here: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/zwisler/linux.git/log/?h=hmem_sysfs > > > > ==== Lots of Details ==== > > > > This patch set is only concerned with CPU-addressable memory types, not > > on-device memory like what we have with Jerome Glisse's HMM series: > > > > https://lwn.net/Articles/726691/ > > > > This patch set works by enabling the new Heterogeneous Memory Attribute > > Table (HMAT) table, newly defined in ACPI 6.2. One major conceptual change > > in ACPI 6.2 related to this work is that proximity domains no longer need > > to contain a processor. We can now have memory-only proximity domains, > > which means that we can now have memory-only Linux NUMA nodes. > > > > Here is an example configuration where we have a single processor, one > > range of regular memory and one range of HBM: > > > > +---------------+ +----------------+ > > | Processor | | Memory | > > | prox domain 0 +---+ prox domain 1 | > > | NUMA node 1 | | NUMA node 2 | > > +-------+-------+ +----------------+ > > | > > +-------+----------+ > > | HBM | > > | prox domain 2 | > > | NUMA node 0 | > > +------------------+ > > > > This gives us one initiator (the processor) and two targets (the two memory > > ranges). Each of these three has its own ACPI proximity domain and > > associated Linux NUMA node. Note also that while there is a 1:1 mapping > > from each proximity domain to each NUMA node, the numbers don't necessarily > > match up. Additionally we can have extra NUMA nodes that don't map back to > > ACPI proximity domains. > > Could you expand on proximity domains, are they the same as node distance > or is this ACPI terminology for something more? I think I answered this above in my explanation of the "firmware_id" field, but please let me know if you have any more questions. Basically, a proximity domain is an ACPI concept that is very similar to a Linux NUMA node, and every ACPI proximity domain generates and can be mapped to a unique Linux NUMA node. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id DF9A721CF25AC for ; Fri, 7 Jul 2017 09:23:32 -0700 (PDT) Date: Fri, 7 Jul 2017 10:25:12 -0600 From: Ross Zwisler Subject: Re: [RFC v2 0/5] surface heterogeneous memory performance information Message-ID: <20170707162512.GA22856@linux.intel.com> References: <20170706215233.11329-1-ross.zwisler@linux.intel.com> <1499408836.23251.3.camel@gmail.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <1499408836.23251.3.camel@gmail.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" To: Balbir Singh Cc: "Box, David E" , Dave Hansen , "Zheng, Lv" , linux-nvdimm@lists.01.org, "Rafael J. Wysocki" , Anaczkowski,, Robert, Lukasz, "Erik , Len Brown" , Jerome Glisse , devel@acpica.org, "Kogut, Jaroslaw" , linux-mm@kvack.org, Greg Kroah-Hartman , "Nachimuthu, Murugasamy" , "Rafael J. Wysocki" , linux-kernel@vger.kernel.org, "Lahtinen, Joonas" , Andrew Morton , Tim Chen List-ID: T24gRnJpLCBKdWwgMDcsIDIwMTcgYXQgMDQ6Mjc6MTZQTSArMTAwMCwgQmFsYmlyIFNpbmdoIHdy b3RlOgo+IE9uIFRodSwgMjAxNy0wNy0wNiBhdCAxNTo1MiAtMDYwMCwgUm9zcyBad2lzbGVyIHdy b3RlOgo+ID4gPT09PSBRdWljayBTdW1tYXJ5ID09PT0KPiA+IAo+ID4gUGxhdGZvcm1zIGluIHRo ZSB2ZXJ5IG5lYXIgZnV0dXJlIHdpbGwgaGF2ZSBtdWx0aXBsZSB0eXBlcyBvZiBtZW1vcnkKPiA+ IGF0dGFjaGVkIHRvIGEgc2luZ2xlIENQVS4gIFRoZXNlIGRpc3BhcmF0ZSBtZW1vcnkgcmFuZ2Vz IHdpbGwgaGF2ZSBzb21lCj4gPiBjaGFyYWN0ZXJpc3RpY3MgaW4gY29tbW9uLCBzdWNoIGFzIENQ VSBjYWNoZSBjb2hlcmVuY2UsIGJ1dCB0aGV5IGNhbiBoYXZlCj4gPiB3aWRlIHJhbmdlcyBvZiBw ZXJmb3JtYW5jZSBib3RoIGluIHRlcm1zIG9mIGxhdGVuY3kgYW5kIGJhbmR3aWR0aC4KPiA+IAo+ ID4gRm9yIGV4YW1wbGUsIGNvbnNpZGVyIGEgc3lzdGVtIHRoYXQgY29udGFpbnMgcGVyc2lzdGVu dCBtZW1vcnksIHN0YW5kYXJkCj4gPiBERFIgbWVtb3J5IGFuZCBIaWdoIEJhbmR3aWR0aCBNZW1v cnkgKEhCTSksIGFsbCBhdHRhY2hlZCB0byB0aGUgc2FtZSBDUFUuCj4gPiBUaGVyZSBjb3VsZCBw b3RlbnRpYWxseSBiZSBhbiBvcmRlciBvZiBtYWduaXR1ZGUgb3IgbW9yZSBkaWZmZXJlbmNlIGlu Cj4gPiBwZXJmb3JtYW5jZSBiZXR3ZWVuIHRoZSBzbG93ZXN0IGFuZCBmYXN0ZXN0IG1lbW9yeSBh dHRhY2hlZCB0byB0aGF0IENQVS4KPiA+IAo+ID4gV2l0aCB0aGUgY3VycmVudCBMaW51eCBjb2Rl IE5VTUEgbm9kZXMgYXJlIENQVS1jZW50cmljLCBzbyBhbGwgdGhlIG1lbW9yeQo+ID4gYXR0YWNo ZWQgdG8gYSBnaXZlbiBDUFUgd2lsbCBiZSBsdW1wZWQgaW50byB0aGUgc2FtZSBOVU1BIG5vZGUu ICBUaGlzIG1ha2VzCj4gPiBpdCB2ZXJ5IGRpZmZpY3VsdCBmb3IgdXNlcnNwYWNlIGFwcGxpY2F0 aW9ucyB0byB1bmRlcnN0YW5kIHRoZSBwZXJmb3JtYW5jZQo+ID4gb2YgZGlmZmVyZW50IG1lbW9y eSByYW5nZXMgb24gYSBnaXZlbiBDUFUuCj4gPiAKPiA+IFdlIHNvbHZlIHRoaXMgaXNzdWUgYnkg cHJvdmlkaW5nIHVzZXJzcGFjZSB3aXRoIHBlcmZvcm1hbmNlIGluZm9ybWF0aW9uIG9uCj4gPiBp bmRpdmlkdWFsIG1lbW9yeSByYW5nZXMuICBUaGlzIHBlcmZvcm1hbmNlIGluZm9ybWF0aW9uIGlz IGV4cG9zZWQgdmlhCj4gPiBzeXNmczoKPiA+IAo+ID4gICAjIGdyZXAgLiBtZW1fdGd0Mi8qIG1l bV90Z3QyL2xvY2FsX2luaXQvKiAyPi9kZXYvbnVsbAo+ID4gICBtZW1fdGd0Mi9maXJtd2FyZV9p ZDoxCj4gPiAgIG1lbV90Z3QyL2lzX2NhY2hlZDowCj4gPiAgIG1lbV90Z3QyL2lzX2VuYWJsZWQ6 MQo+ID4gICBtZW1fdGd0Mi9pc19pc29sYXRlZDowCj4gCj4gQ291bGQgeW91IHBsZWFzZSBleHBs YWluIHRoZXNlIGNoYXJhY3RlcnNpdGljcywgYXJlIHRoZXkgaW4gdGhlIHBhdGNoZXMKPiB0byBm b2xsb3c/CgpZZWEsIHNvcnJ5LCB0aGVzZSBkbyBuZWVkIG1vcmUgZXhwbGFuYXRpb24uICBUaGVz ZSB2YWx1ZXMgYXJlIGRlcml2ZWQgZnJvbSB0aGUKQUNQSSBTUkFUL0hNQVQgdGFibGVzOgoKPiA+ ICAgbWVtX3RndDIvZmlybXdhcmVfaWQ6MQoKVGhpcyBpcyB0aGUgcHJveGltaXR5IGRvbWFpbiwg YXMgZGVmaW5lZCBpbiB0aGUgU1JBVCBhbmQgSE1BVC4gIEJhc2ljYWxseQpldmVyeSBBQ1BJIHBy b3hpbWl0eSBkb21haW4gd2lsbCBlbmQgdXAgYmVpbmcgYSB1bmlxdWUgTlVNQSBub2RlIGluIExp bnV4LCBidXQKdGhlIG51bWJlcnMgbWF5IGdldCByZW9yZGVyZWQgYW5kIExpbnV4IGNhbiBjcmVh dGUgZXh0cmEgTlVNQSBub2RlcyB0aGF0IGRvbid0Cm1hcCBiYWNrIHRvIEFDUEkgcHJveGltaXR5 IGRvbWFpbnMuICBTbywgdGhpcyB2YWx1ZSBpcyBuZWVkZWQgaWYgYW55b25lIGV2ZXIKd2FudHMg dG8gbG9vayBhdCB0aGUgQUNQSSBITUFUIGFuZCBTUkFUIHRhYmxlcyBkaXJlY3RseSBhbmQgbWFr ZSBzZW5zZSBvZiBob3cKdGhleSBtYXAgdG8gTlVNQSBub2RlcyBpbiBMaW51eC4KCj4gPiAgIG1l bV90Z3QyL2lzX2NhY2hlZDowCgpUaGUgSE1BVCBwcm92aWRlcyBsb3RzIG9mIGRldGFpbGVkIGlu Zm9ybWF0aW9uIHdoZW4gYSBtZW1vcnkgcmVnaW9uIGhhcwpjYWNoaW5nIGxheWVycy4gIEZvciBl YWNoIGxheWVyIG9mIG1lbW9yeSBjYWNoaW5nIGl0IGhhcyB0aGUgYWJpbGl0eSB0bwpwcm92aWRl IGxhdGVuY3kgYW5kIGJhbmR3aWR0aCBpbmZvcm1hdGlvbiBmb3IgYm90aCByZWFkcyBhbmQgd3Jp dGVzLAppbmZvcm1hdGlvbiBhYm91dCB0aGUgY2FjaGluZyBhc3NvY2lhdGl2aXR5IChkaXJlY3Qg bWFwcGVkLCBzb21ldGhpbmcgbW9yZQpjb21wbGV4KSwgdGhlIHdyaXRlYmFjayBwb2xpY3kgKFdC LCBXVCksIHRoZSBjYWNoZSBsaW5lIHNpemUsIGV0Yy4KCkZvciBzaW1wbGljaXR5IHRoaXMgc3lz ZnMgaW50ZXJmYWNlIGRvZXNuJ3QgZXhwb3NlIHRoYXQgbGV2ZWwgb2YgZGV0YWlsIHRvIHRoZQp1 c2VyLCBhbmQgdGhpcyBmbGFnIGp1c3QgbGV0cyB0aGUgdXNlciBrbm93IHdoZXRoZXIgdGhlIG1l bW9yeSByZWdpb24gdGhleSBhcmUKbG9va2luZyBhdCBoYXMgY2FjaGluZyBsYXllcnMgb3Igbm90 LiAgUmlnaHQgbm93IHRoZSBhZGRpdGlvbmFsIGRldGFpbHMsIGlmCmRlc2lyZWQsIGNhbiBiZSBn YXRoZXJlZCBieSBsb29raW5nIGF0IHRoZSByYXcgdGFibGVzLgoKPiA+ICAgbWVtX3RndDIvaXNf ZW5hYmxlZDoxCgpUZWxscyB3aGV0aGVyIHRoZSBtZW1vcnkgcmVnaW9uIGlzIGVuYWJsZWQsIGFz IGRlZmluZWQgYnkgdGhlIGZsYWdzIGluIHRoZQpTUkFULiAgQWN0dWFsbHksIHRob3VnaCwgaW4g dGhpcyB2ZXJzaW9uIG9mIHRoZSBwYXRjaCBzZXJpZXMgd2UgZG9uJ3QgY3JlYXRlCmVudHJpZXMg Zm9yIENQVXMgb3IgbWVtb3J5IHJlZ2lvbnMgdGhhdCBhcmVuJ3QgZW5hYmxlZCwgc28gdGhpcyBp c24ndCBuZWVkZWQuCkknbGwgcmVtb3ZlIGZvciB2My4KCj4gPiAgIG1lbV90Z3QyL2lzX2lzb2xh dGVkOjAKClRoaXMgc3VyZmFjZXMgYSBmbGFnIGluIHRoZSBITUFUJ3MgTWVtb3J5IFN1YnN5c3Rl bSBBZGRyZXNzIFJhbmdlIFN0cnVjdHVyZToKCiAgQml0IFsyXTogUmVzZXJ2YXRpb24gaGludOKA lGlmIHNldCB0byAxLCBpdCBpcyByZWNvbW1lbmRlZAogIHRoYXQgdGhlIG9wZXJhdGluZyBzeXN0 ZW0gYXZvaWQgcGxhY2luZyBhbGxvY2F0aW9ucyBpbgogIHRoaXMgcmVnaW9uIGlmIGl0IGNhbm5v dCByZWxvY2F0ZSAoZS5nLiBPUyBjb3JlIG1lbW9yeQogIG1hbmFnZW1lbnQgc3RydWN0dXJlcywg T1MgY29yZSBleGVjdXRhYmxlKS4gQW55CiAgYWxsb2NhdGlvbnMgcGxhY2VkIGhlcmUgc2hvdWxk IGJlIGFibGUgdG8gYmUgcmVsb2NhdGVkCiAgKGUuZy4gZGlzayBjYWNoZSkgaWYgdGhlIG1lbW9y eSBpcyBuZWVkZWQgZm9yIGFub3RoZXIKICBwdXJwb3NlLgoKQWRkaW5nIGtlcm5lbCBzdXBwb3J0 IGZvciB0aGlzIGhpbnQgKGkuZS4gYWN0dWFsbHkgcmVzZXJ2aW5nIHRoZSBtZW1vcnkgcmVnaW9u CmR1cmluZyBib290IHNvIGl0IGlzbid0IHVzZWQgYnkgdGhlIGtlcm5lbCBvciB1c2Vyc3BhY2Us IGFuZCBpcyBmdWxseQphdmFpbGFibGUgZm9yIGV4cGxpY2l0IGFsbG9jYXRpb24pIGlzIHBhcnQg b2YgdGhlIGZ1dHVyZSB3b3JrIHRoYXQgd2UnZCBkbyBpbgpmb2xsb3ctb24gcGF0Y2ggc2VyaWVz LgoKPiA+ICAgbWVtX3RndDIvcGh5c19hZGRyX2Jhc2U6MHgwCj4gPiAgIG1lbV90Z3QyL3BoeXNf bGVuZ3RoX2J5dGVzOjB4ODAwMDAwMDAwCj4gPiAgIG1lbV90Z3QyL2xvY2FsX2luaXQvcmVhZF9i d19NQnBzOjMwNzIwCj4gPiAgIG1lbV90Z3QyL2xvY2FsX2luaXQvcmVhZF9sYXRfbnNlYzoxMDAK PiA+ICAgbWVtX3RndDIvbG9jYWxfaW5pdC93cml0ZV9id19NQnBzOjMwNzIwCj4gPiAgIG1lbV90 Z3QyL2xvY2FsX2luaXQvd3JpdGVfbGF0X25zZWM6MTAwCj4gCj4gSG93IHRvIHRoZXNlIG51bWJl cnMgY29tcGFyZSB0byBub3JtYWwgc3lzdGVtIG1lbW9yeT8KClRoZXNlIGFyZSBnYXJiYWdlIG51 bWJlcnMgdGhhdCBJIG1hZGUgdXAgaW4gbXkgaGFja2VkLXVwIFFFTVUgdGFyZ2V0LiA6KSAgCgo+ ID4gVGhpcyBhbGxvd3MgYXBwbGljYXRpb25zIHRvIGVhc2lseSBmaW5kIHRoZSBtZW1vcnkgdGhh dCB0aGV5IHdhbnQgdG8gdXNlLgo+ID4gV2UgZXhwZWN0IHRoYXQgdGhlIGV4aXN0aW5nIE5VTUEg QVBJcyB3aWxsIGJlIGVuaGFuY2VkIHRvIHVzZSB0aGlzIG5ldwo+ID4gaW5mb3JtYXRpb24gc28g dGhhdCBhcHBsaWNhdGlvbnMgY2FuIGNvbnRpbnVlIHRvIHVzZSB0aGVtIHRvIHNlbGVjdCB0aGVp cgo+ID4gZGVzaXJlZCBtZW1vcnkuCj4gPiAKPiA+IFRoaXMgc2VyaWVzIGlzIGJ1aWx0IHVwb24g YWNwaWNhLTE3MDU6Cj4gPiAKPiA+IGh0dHBzOi8vZ2l0aHViLmNvbS96ZXRhbG9nL2xpbnV4L2Nv bW1pdHMvYWNwaWNhLTE3MDUKPiA+IAo+ID4gQW5kIHlvdSBjYW4gZmluZCBhIHdvcmtpbmcgdHJl ZSBoZXJlOgo+ID4gCj4gPiBodHRwczovL2dpdC5rZXJuZWwub3JnL3B1Yi9zY20vbGludXgva2Vy bmVsL2dpdC96d2lzbGVyL2xpbnV4LmdpdC9sb2cvP2g9aG1lbV9zeXNmcwo+ID4gCj4gPiA9PT09 IExvdHMgb2YgRGV0YWlscyA9PT09Cj4gPiAKPiA+IFRoaXMgcGF0Y2ggc2V0IGlzIG9ubHkgY29u Y2VybmVkIHdpdGggQ1BVLWFkZHJlc3NhYmxlIG1lbW9yeSB0eXBlcywgbm90Cj4gPiBvbi1kZXZp Y2UgbWVtb3J5IGxpa2Ugd2hhdCB3ZSBoYXZlIHdpdGggSmVyb21lIEdsaXNzZSdzIEhNTSBzZXJp ZXM6Cj4gPiAKPiA+IGh0dHBzOi8vbHduLm5ldC9BcnRpY2xlcy83MjY2OTEvCj4gPiAKPiA+IFRo aXMgcGF0Y2ggc2V0IHdvcmtzIGJ5IGVuYWJsaW5nIHRoZSBuZXcgSGV0ZXJvZ2VuZW91cyBNZW1v cnkgQXR0cmlidXRlCj4gPiBUYWJsZSAoSE1BVCkgdGFibGUsIG5ld2x5IGRlZmluZWQgaW4gQUNQ SSA2LjIuIE9uZSBtYWpvciBjb25jZXB0dWFsIGNoYW5nZQo+ID4gaW4gQUNQSSA2LjIgcmVsYXRl ZCB0byB0aGlzIHdvcmsgaXMgdGhhdCBwcm94aW1pdHkgZG9tYWlucyBubyBsb25nZXIgbmVlZAo+ ID4gdG8gY29udGFpbiBhIHByb2Nlc3Nvci4gIFdlIGNhbiBub3cgaGF2ZSBtZW1vcnktb25seSBw cm94aW1pdHkgZG9tYWlucywKPiA+IHdoaWNoIG1lYW5zIHRoYXQgd2UgY2FuIG5vdyBoYXZlIG1l bW9yeS1vbmx5IExpbnV4IE5VTUEgbm9kZXMuCj4gPiAKPiA+IEhlcmUgaXMgYW4gZXhhbXBsZSBj b25maWd1cmF0aW9uIHdoZXJlIHdlIGhhdmUgYSBzaW5nbGUgcHJvY2Vzc29yLCBvbmUKPiA+IHJh bmdlIG9mIHJlZ3VsYXIgbWVtb3J5IGFuZCBvbmUgcmFuZ2Ugb2YgSEJNOgo+ID4gCj4gPiAgICst LS0tLS0tLS0tLS0tLS0rICAgKy0tLS0tLS0tLS0tLS0tLS0rCj4gPiAgIHwgUHJvY2Vzc29yICAg ICB8ICAgfCBNZW1vcnkgICAgICAgICB8Cj4gPiAgIHwgcHJveCBkb21haW4gMCArLS0tKyBwcm94 IGRvbWFpbiAxICB8Cj4gPiAgIHwgTlVNQSBub2RlIDEgICB8ICAgfCBOVU1BIG5vZGUgMiAgICB8 Cj4gPiAgICstLS0tLS0tKy0tLS0tLS0rICAgKy0tLS0tLS0tLS0tLS0tLS0rCj4gPiAgICAgICAg ICAgfAo+ID4gICArLS0tLS0tLSstLS0tLS0tLS0tKwo+ID4gICB8IEhCTSAgICAgICAgICAgICAg fAo+ID4gICB8IHByb3ggZG9tYWluIDIgICAgfAo+ID4gICB8IE5VTUEgbm9kZSAwICAgICAgfAo+ ID4gICArLS0tLS0tLS0tLS0tLS0tLS0tKwo+ID4gCj4gPiBUaGlzIGdpdmVzIHVzIG9uZSBpbml0 aWF0b3IgKHRoZSBwcm9jZXNzb3IpIGFuZCB0d28gdGFyZ2V0cyAodGhlIHR3byBtZW1vcnkKPiA+ IHJhbmdlcykuICBFYWNoIG9mIHRoZXNlIHRocmVlIGhhcyBpdHMgb3duIEFDUEkgcHJveGltaXR5 IGRvbWFpbiBhbmQKPiA+IGFzc29jaWF0ZWQgTGludXggTlVNQSBub2RlLiAgTm90ZSBhbHNvIHRo YXQgd2hpbGUgdGhlcmUgaXMgYSAxOjEgbWFwcGluZwo+ID4gZnJvbSBlYWNoIHByb3hpbWl0eSBk b21haW4gdG8gZWFjaCBOVU1BIG5vZGUsIHRoZSBudW1iZXJzIGRvbid0IG5lY2Vzc2FyaWx5Cj4g PiBtYXRjaCB1cC4gIEFkZGl0aW9uYWxseSB3ZSBjYW4gaGF2ZSBleHRyYSBOVU1BIG5vZGVzIHRo YXQgZG9uJ3QgbWFwIGJhY2sgdG8KPiA+IEFDUEkgcHJveGltaXR5IGRvbWFpbnMuCj4gCj4gQ291 bGQgeW91IGV4cGFuZCBvbiBwcm94aW1pdHkgZG9tYWlucywgYXJlIHRoZXkgdGhlIHNhbWUgYXMg bm9kZSBkaXN0YW5jZQo+IG9yIGlzIHRoaXMgQUNQSSB0ZXJtaW5vbG9neSBmb3Igc29tZXRoaW5n IG1vcmU/CgpJIHRoaW5rIEkgYW5zd2VyZWQgdGhpcyBhYm92ZSBpbiBteSBleHBsYW5hdGlvbiBv ZiB0aGUgImZpcm13YXJlX2lkIiBmaWVsZCwKYnV0IHBsZWFzZSBsZXQgbWUga25vdyBpZiB5b3Ug aGF2ZSBhbnkgbW9yZSBxdWVzdGlvbnMuICBCYXNpY2FsbHksIGEgcHJveGltaXR5CmRvbWFpbiBp cyBhbiBBQ1BJIGNvbmNlcHQgdGhhdCBpcyB2ZXJ5IHNpbWlsYXIgdG8gYSBMaW51eCBOVU1BIG5v ZGUsIGFuZCBldmVyeQpBQ1BJIHByb3hpbWl0eSBkb21haW4gZ2VuZXJhdGVzIGFuZCBjYW4gYmUg bWFwcGVkIHRvIGEgdW5pcXVlIExpbnV4IE5VTUEgbm9kZS4KX19fX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fX19fX18KTGludXgtbnZkaW1tIG1haWxpbmcgbGlzdApMaW51 eC1udmRpbW1AbGlzdHMuMDEub3JnCmh0dHBzOi8vbGlzdHMuMDEub3JnL21haWxtYW4vbGlzdGlu Zm8vbGludXgtbnZkaW1tCg== From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id 0FD4D6B02F3 for ; Fri, 7 Jul 2017 12:25:16 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id u5so38612001pgq.14 for ; Fri, 07 Jul 2017 09:25:16 -0700 (PDT) Received: from mga04.intel.com (mga04.intel.com. [192.55.52.120]) by mx.google.com with ESMTPS id n12si2484166pgr.349.2017.07.07.09.25.14 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 07 Jul 2017 09:25:15 -0700 (PDT) Date: Fri, 7 Jul 2017 10:25:12 -0600 From: Ross Zwisler Subject: Re: [RFC v2 0/5] surface heterogeneous memory performance information Message-ID: <20170707162512.GA22856@linux.intel.com> References: <20170706215233.11329-1-ross.zwisler@linux.intel.com> <1499408836.23251.3.camel@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1499408836.23251.3.camel@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Balbir Singh Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "Anaczkowski, Lukasz" , "Box, David E" , "Kogut, Jaroslaw" , "Lahtinen, Joonas" , "Moore, Robert" , "Nachimuthu, Murugasamy" , "Odzioba, Lukasz" , "Rafael J. Wysocki" , "Rafael J. Wysocki" , "Schmauss, Erik" , "Verma, Vishal L" , "Zheng, Lv" , Andrew Morton , Dan Williams , Dave Hansen , Greg Kroah-Hartman , Jerome Glisse , Len Brown , Tim Chen , devel@acpica.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org On Fri, Jul 07, 2017 at 04:27:16PM +1000, Balbir Singh wrote: > On Thu, 2017-07-06 at 15:52 -0600, Ross Zwisler wrote: > > ==== Quick Summary ==== > > > > Platforms in the very near future will have multiple types of memory > > attached to a single CPU. These disparate memory ranges will have some > > characteristics in common, such as CPU cache coherence, but they can have > > wide ranges of performance both in terms of latency and bandwidth. > > > > For example, consider a system that contains persistent memory, standard > > DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU. > > There could potentially be an order of magnitude or more difference in > > performance between the slowest and fastest memory attached to that CPU. > > > > With the current Linux code NUMA nodes are CPU-centric, so all the memory > > attached to a given CPU will be lumped into the same NUMA node. This makes > > it very difficult for userspace applications to understand the performance > > of different memory ranges on a given CPU. > > > > We solve this issue by providing userspace with performance information on > > individual memory ranges. This performance information is exposed via > > sysfs: > > > > # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null > > mem_tgt2/firmware_id:1 > > mem_tgt2/is_cached:0 > > mem_tgt2/is_enabled:1 > > mem_tgt2/is_isolated:0 > > Could you please explain these charactersitics, are they in the patches > to follow? Yea, sorry, these do need more explanation. These values are derived from the ACPI SRAT/HMAT tables: > > mem_tgt2/firmware_id:1 This is the proximity domain, as defined in the SRAT and HMAT. Basically every ACPI proximity domain will end up being a unique NUMA node in Linux, but the numbers may get reordered and Linux can create extra NUMA nodes that don't map back to ACPI proximity domains. So, this value is needed if anyone ever wants to look at the ACPI HMAT and SRAT tables directly and make sense of how they map to NUMA nodes in Linux. > > mem_tgt2/is_cached:0 The HMAT provides lots of detailed information when a memory region has caching layers. For each layer of memory caching it has the ability to provide latency and bandwidth information for both reads and writes, information about the caching associativity (direct mapped, something more complex), the writeback policy (WB, WT), the cache line size, etc. For simplicity this sysfs interface doesn't expose that level of detail to the user, and this flag just lets the user know whether the memory region they are looking at has caching layers or not. Right now the additional details, if desired, can be gathered by looking at the raw tables. > > mem_tgt2/is_enabled:1 Tells whether the memory region is enabled, as defined by the flags in the SRAT. Actually, though, in this version of the patch series we don't create entries for CPUs or memory regions that aren't enabled, so this isn't needed. I'll remove for v3. > > mem_tgt2/is_isolated:0 This surfaces a flag in the HMAT's Memory Subsystem Address Range Structure: Bit [2]: Reservation hinta??if set to 1, it is recommended that the operating system avoid placing allocations in this region if it cannot relocate (e.g. OS core memory management structures, OS core executable). Any allocations placed here should be able to be relocated (e.g. disk cache) if the memory is needed for another purpose. Adding kernel support for this hint (i.e. actually reserving the memory region during boot so it isn't used by the kernel or userspace, and is fully available for explicit allocation) is part of the future work that we'd do in follow-on patch series. > > mem_tgt2/phys_addr_base:0x0 > > mem_tgt2/phys_length_bytes:0x800000000 > > mem_tgt2/local_init/read_bw_MBps:30720 > > mem_tgt2/local_init/read_lat_nsec:100 > > mem_tgt2/local_init/write_bw_MBps:30720 > > mem_tgt2/local_init/write_lat_nsec:100 > > How to these numbers compare to normal system memory? These are garbage numbers that I made up in my hacked-up QEMU target. :) > > This allows applications to easily find the memory that they want to use. > > We expect that the existing NUMA APIs will be enhanced to use this new > > information so that applications can continue to use them to select their > > desired memory. > > > > This series is built upon acpica-1705: > > > > https://github.com/zetalog/linux/commits/acpica-1705 > > > > And you can find a working tree here: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/zwisler/linux.git/log/?h=hmem_sysfs > > > > ==== Lots of Details ==== > > > > This patch set is only concerned with CPU-addressable memory types, not > > on-device memory like what we have with Jerome Glisse's HMM series: > > > > https://lwn.net/Articles/726691/ > > > > This patch set works by enabling the new Heterogeneous Memory Attribute > > Table (HMAT) table, newly defined in ACPI 6.2. One major conceptual change > > in ACPI 6.2 related to this work is that proximity domains no longer need > > to contain a processor. We can now have memory-only proximity domains, > > which means that we can now have memory-only Linux NUMA nodes. > > > > Here is an example configuration where we have a single processor, one > > range of regular memory and one range of HBM: > > > > +---------------+ +----------------+ > > | Processor | | Memory | > > | prox domain 0 +---+ prox domain 1 | > > | NUMA node 1 | | NUMA node 2 | > > +-------+-------+ +----------------+ > > | > > +-------+----------+ > > | HBM | > > | prox domain 2 | > > | NUMA node 0 | > > +------------------+ > > > > This gives us one initiator (the processor) and two targets (the two memory > > ranges). Each of these three has its own ACPI proximity domain and > > associated Linux NUMA node. Note also that while there is a 1:1 mapping > > from each proximity domain to each NUMA node, the numbers don't necessarily > > match up. Additionally we can have extra NUMA nodes that don't map back to > > ACPI proximity domains. > > Could you expand on proximity domains, are they the same as node distance > or is this ACPI terminology for something more? I think I answered this above in my explanation of the "firmware_id" field, but please let me know if you have any more questions. Basically, a proximity domain is an ACPI concept that is very similar to a Linux NUMA node, and every ACPI proximity domain generates and can be mapped to a unique Linux NUMA node. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751994AbdGGQZR (ORCPT ); Fri, 7 Jul 2017 12:25:17 -0400 Received: from mga05.intel.com ([192.55.52.43]:39396 "EHLO mga05.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750892AbdGGQZP (ORCPT ); Fri, 7 Jul 2017 12:25:15 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.40,323,1496127600"; d="scan'208";a="105904917" Date: Fri, 7 Jul 2017 10:25:12 -0600 From: Ross Zwisler To: Balbir Singh Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "Anaczkowski, Lukasz" , "Box, David E" , "Kogut, Jaroslaw" , "Lahtinen, Joonas" , "Moore, Robert" , "Nachimuthu, Murugasamy" , "Odzioba, Lukasz" , "Rafael J. Wysocki" , "Rafael J. Wysocki" , "Schmauss, Erik" , "Verma, Vishal L" , "Zheng, Lv" , Andrew Morton , Dan Williams , Dave Hansen , Greg Kroah-Hartman , Jerome Glisse , Len Brown , Tim Chen , devel@acpica.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org Subject: Re: [RFC v2 0/5] surface heterogeneous memory performance information Message-ID: <20170707162512.GA22856@linux.intel.com> References: <20170706215233.11329-1-ross.zwisler@linux.intel.com> <1499408836.23251.3.camel@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1499408836.23251.3.camel@gmail.com> User-Agent: Mutt/1.8.3 (2017-05-23) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 07, 2017 at 04:27:16PM +1000, Balbir Singh wrote: > On Thu, 2017-07-06 at 15:52 -0600, Ross Zwisler wrote: > > ==== Quick Summary ==== > > > > Platforms in the very near future will have multiple types of memory > > attached to a single CPU. These disparate memory ranges will have some > > characteristics in common, such as CPU cache coherence, but they can have > > wide ranges of performance both in terms of latency and bandwidth. > > > > For example, consider a system that contains persistent memory, standard > > DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU. > > There could potentially be an order of magnitude or more difference in > > performance between the slowest and fastest memory attached to that CPU. > > > > With the current Linux code NUMA nodes are CPU-centric, so all the memory > > attached to a given CPU will be lumped into the same NUMA node. This makes > > it very difficult for userspace applications to understand the performance > > of different memory ranges on a given CPU. > > > > We solve this issue by providing userspace with performance information on > > individual memory ranges. This performance information is exposed via > > sysfs: > > > > # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null > > mem_tgt2/firmware_id:1 > > mem_tgt2/is_cached:0 > > mem_tgt2/is_enabled:1 > > mem_tgt2/is_isolated:0 > > Could you please explain these charactersitics, are they in the patches > to follow? Yea, sorry, these do need more explanation. These values are derived from the ACPI SRAT/HMAT tables: > > mem_tgt2/firmware_id:1 This is the proximity domain, as defined in the SRAT and HMAT. Basically every ACPI proximity domain will end up being a unique NUMA node in Linux, but the numbers may get reordered and Linux can create extra NUMA nodes that don't map back to ACPI proximity domains. So, this value is needed if anyone ever wants to look at the ACPI HMAT and SRAT tables directly and make sense of how they map to NUMA nodes in Linux. > > mem_tgt2/is_cached:0 The HMAT provides lots of detailed information when a memory region has caching layers. For each layer of memory caching it has the ability to provide latency and bandwidth information for both reads and writes, information about the caching associativity (direct mapped, something more complex), the writeback policy (WB, WT), the cache line size, etc. For simplicity this sysfs interface doesn't expose that level of detail to the user, and this flag just lets the user know whether the memory region they are looking at has caching layers or not. Right now the additional details, if desired, can be gathered by looking at the raw tables. > > mem_tgt2/is_enabled:1 Tells whether the memory region is enabled, as defined by the flags in the SRAT. Actually, though, in this version of the patch series we don't create entries for CPUs or memory regions that aren't enabled, so this isn't needed. I'll remove for v3. > > mem_tgt2/is_isolated:0 This surfaces a flag in the HMAT's Memory Subsystem Address Range Structure: Bit [2]: Reservation hint—if set to 1, it is recommended that the operating system avoid placing allocations in this region if it cannot relocate (e.g. OS core memory management structures, OS core executable). Any allocations placed here should be able to be relocated (e.g. disk cache) if the memory is needed for another purpose. Adding kernel support for this hint (i.e. actually reserving the memory region during boot so it isn't used by the kernel or userspace, and is fully available for explicit allocation) is part of the future work that we'd do in follow-on patch series. > > mem_tgt2/phys_addr_base:0x0 > > mem_tgt2/phys_length_bytes:0x800000000 > > mem_tgt2/local_init/read_bw_MBps:30720 > > mem_tgt2/local_init/read_lat_nsec:100 > > mem_tgt2/local_init/write_bw_MBps:30720 > > mem_tgt2/local_init/write_lat_nsec:100 > > How to these numbers compare to normal system memory? These are garbage numbers that I made up in my hacked-up QEMU target. :) > > This allows applications to easily find the memory that they want to use. > > We expect that the existing NUMA APIs will be enhanced to use this new > > information so that applications can continue to use them to select their > > desired memory. > > > > This series is built upon acpica-1705: > > > > https://github.com/zetalog/linux/commits/acpica-1705 > > > > And you can find a working tree here: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/zwisler/linux.git/log/?h=hmem_sysfs > > > > ==== Lots of Details ==== > > > > This patch set is only concerned with CPU-addressable memory types, not > > on-device memory like what we have with Jerome Glisse's HMM series: > > > > https://lwn.net/Articles/726691/ > > > > This patch set works by enabling the new Heterogeneous Memory Attribute > > Table (HMAT) table, newly defined in ACPI 6.2. One major conceptual change > > in ACPI 6.2 related to this work is that proximity domains no longer need > > to contain a processor. We can now have memory-only proximity domains, > > which means that we can now have memory-only Linux NUMA nodes. > > > > Here is an example configuration where we have a single processor, one > > range of regular memory and one range of HBM: > > > > +---------------+ +----------------+ > > | Processor | | Memory | > > | prox domain 0 +---+ prox domain 1 | > > | NUMA node 1 | | NUMA node 2 | > > +-------+-------+ +----------------+ > > | > > +-------+----------+ > > | HBM | > > | prox domain 2 | > > | NUMA node 0 | > > +------------------+ > > > > This gives us one initiator (the processor) and two targets (the two memory > > ranges). Each of these three has its own ACPI proximity domain and > > associated Linux NUMA node. Note also that while there is a 1:1 mapping > > from each proximity domain to each NUMA node, the numbers don't necessarily > > match up. Additionally we can have extra NUMA nodes that don't map back to > > ACPI proximity domains. > > Could you expand on proximity domains, are they the same as node distance > or is this ACPI terminology for something more? I think I answered this above in my explanation of the "firmware_id" field, but please let me know if you have any more questions. Basically, a proximity domain is an ACPI concept that is very similar to a Linux NUMA node, and every ACPI proximity domain generates and can be mapped to a unique Linux NUMA node.