Re: [RFC v2 0/5] surface heterogeneous memory performance information

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Balbir Singh <bsingharora@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
	linux-kernel@vger.kernel.org, "Anaczkowski,
	Lukasz" <lukasz.anaczkowski@intel.com>,
	"Box, David E" <david.e.box@intel.com>,
	"Kogut, Jaroslaw" <Jaroslaw.Kogut@intel.com>,
	"Lahtinen, Joonas" <joonas.lahtinen@intel.com>,
	"Moore, Robert" <robert.moore@intel.com>,
	"Nachimuthu, Murugasamy" <murugasamy.nachimuthu@intel.com>,
	"Odzioba, Lukasz" <lukasz.odzioba@intel.com>,
	"Rafael J. Wysocki" <rafael.j.wysocki@intel.com>,
	"Rafael J. Wysocki" <rjw@rjwysocki.net>,
	"Schmauss, Erik" <erik.schmauss@intel.com>,
	"Verma, Vishal L" <vishal.l.verma@intel.com>,
	"Zheng, Lv" <lv.zheng@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dan Williams <dan.j.williams@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Jerome Glisse <jglisse@redhat.com>, Len Brown <lenb@kernel.org>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	devel@acpica.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-nvdimm@lists.01.org
Subject: Re: [RFC v2 0/5] surface heterogeneous memory performance information
Date: Fri, 7 Jul 2017 10:25:12 -0600	[thread overview]
Message-ID: <20170707162512.GA22856@linux.intel.com> (raw)
In-Reply-To: <1499408836.23251.3.camel@gmail.com>

On Fri, Jul 07, 2017 at 04:27:16PM +1000, Balbir Singh wrote:
> On Thu, 2017-07-06 at 15:52 -0600, Ross Zwisler wrote:
> > ==== Quick Summary ====
> > 
> > Platforms in the very near future will have multiple types of memory
> > attached to a single CPU.  These disparate memory ranges will have some
> > characteristics in common, such as CPU cache coherence, but they can have
> > wide ranges of performance both in terms of latency and bandwidth.
> > 
> > For example, consider a system that contains persistent memory, standard
> > DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU.
> > There could potentially be an order of magnitude or more difference in
> > performance between the slowest and fastest memory attached to that CPU.
> > 
> > With the current Linux code NUMA nodes are CPU-centric, so all the memory
> > attached to a given CPU will be lumped into the same NUMA node.  This makes
> > it very difficult for userspace applications to understand the performance
> > of different memory ranges on a given CPU.
> > 
> > We solve this issue by providing userspace with performance information on
> > individual memory ranges.  This performance information is exposed via
> > sysfs:
> > 
> >   # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null
> >   mem_tgt2/firmware_id:1
> >   mem_tgt2/is_cached:0
> >   mem_tgt2/is_enabled:1
> >   mem_tgt2/is_isolated:0
> 
> Could you please explain these charactersitics, are they in the patches
> to follow?

Yea, sorry, these do need more explanation.  These values are derived from the
ACPI SRAT/HMAT tables:

> >   mem_tgt2/firmware_id:1

This is the proximity domain, as defined in the SRAT and HMAT.  Basically
every ACPI proximity domain will end up being a unique NUMA node in Linux, but
the numbers may get reordered and Linux can create extra NUMA nodes that don't
map back to ACPI proximity domains.  So, this value is needed if anyone ever
wants to look at the ACPI HMAT and SRAT tables directly and make sense of how
they map to NUMA nodes in Linux.

> >   mem_tgt2/is_cached:0

The HMAT provides lots of detailed information when a memory region has
caching layers.  For each layer of memory caching it has the ability to
provide latency and bandwidth information for both reads and writes,
information about the caching associativity (direct mapped, something more
complex), the writeback policy (WB, WT), the cache line size, etc.

For simplicity this sysfs interface doesn't expose that level of detail to the
user, and this flag just lets the user know whether the memory region they are
looking at has caching layers or not.  Right now the additional details, if
desired, can be gathered by looking at the raw tables.

> >   mem_tgt2/is_enabled:1

Tells whether the memory region is enabled, as defined by the flags in the
SRAT.  Actually, though, in this version of the patch series we don't create
entries for CPUs or memory regions that aren't enabled, so this isn't needed.
I'll remove for v3.

> >   mem_tgt2/is_isolated:0

This surfaces a flag in the HMAT's Memory Subsystem Address Range Structure:

  Bit [2]: Reservation hinta??if set to 1, it is recommended
  that the operating system avoid placing allocations in
  this region if it cannot relocate (e.g. OS core memory
  management structures, OS core executable). Any
  allocations placed here should be able to be relocated
  (e.g. disk cache) if the memory is needed for another
  purpose.

Adding kernel support for this hint (i.e. actually reserving the memory region
during boot so it isn't used by the kernel or userspace, and is fully
available for explicit allocation) is part of the future work that we'd do in
follow-on patch series.

> >   mem_tgt2/phys_addr_base:0x0
> >   mem_tgt2/phys_length_bytes:0x800000000
> >   mem_tgt2/local_init/read_bw_MBps:30720
> >   mem_tgt2/local_init/read_lat_nsec:100
> >   mem_tgt2/local_init/write_bw_MBps:30720
> >   mem_tgt2/local_init/write_lat_nsec:100
> 
> How to these numbers compare to normal system memory?

These are garbage numbers that I made up in my hacked-up QEMU target. :)  

> > This allows applications to easily find the memory that they want to use.
> > We expect that the existing NUMA APIs will be enhanced to use this new
> > information so that applications can continue to use them to select their
> > desired memory.
> > 
> > This series is built upon acpica-1705:
> > 
> > https://github.com/zetalog/linux/commits/acpica-1705
> > 
> > And you can find a working tree here:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/zwisler/linux.git/log/?h=hmem_sysfs
> > 
> > ==== Lots of Details ====
> > 
> > This patch set is only concerned with CPU-addressable memory types, not
> > on-device memory like what we have with Jerome Glisse's HMM series:
> > 
> > https://lwn.net/Articles/726691/
> > 
> > This patch set works by enabling the new Heterogeneous Memory Attribute
> > Table (HMAT) table, newly defined in ACPI 6.2. One major conceptual change
> > in ACPI 6.2 related to this work is that proximity domains no longer need
> > to contain a processor.  We can now have memory-only proximity domains,
> > which means that we can now have memory-only Linux NUMA nodes.
> > 
> > Here is an example configuration where we have a single processor, one
> > range of regular memory and one range of HBM:
> > 
> >   +---------------+   +----------------+
> >   | Processor     |   | Memory         |
> >   | prox domain 0 +---+ prox domain 1  |
> >   | NUMA node 1   |   | NUMA node 2    |
> >   +-------+-------+   +----------------+
> >           |
> >   +-------+----------+
> >   | HBM              |
> >   | prox domain 2    |
> >   | NUMA node 0      |
> >   +------------------+
> > 
> > This gives us one initiator (the processor) and two targets (the two memory
> > ranges).  Each of these three has its own ACPI proximity domain and
> > associated Linux NUMA node.  Note also that while there is a 1:1 mapping
> > from each proximity domain to each NUMA node, the numbers don't necessarily
> > match up.  Additionally we can have extra NUMA nodes that don't map back to
> > ACPI proximity domains.
> 
> Could you expand on proximity domains, are they the same as node distance
> or is this ACPI terminology for something more?

I think I answered this above in my explanation of the "firmware_id" field,
but please let me know if you have any more questions.  Basically, a proximity
domain is an ACPI concept that is very similar to a Linux NUMA node, and every
ACPI proximity domain generates and can be mapped to a unique Linux NUMA node.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2017-07-07 16:25 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-06 21:52 [RFC v2 0/5] surface heterogeneous memory performance information Ross Zwisler
2017-07-06 21:52 ` [RFC v2 1/5] acpi: add missing include in acpi_numa.h Ross Zwisler
2017-07-06 22:08   ` Rafael J. Wysocki
2017-07-06 21:52 ` [RFC v2 2/5] acpi: HMAT support in acpi_parse_entries_array() Ross Zwisler
2017-07-06 22:13   ` Rafael J. Wysocki
2017-07-06 22:22     ` Ross Zwisler
2017-07-06 22:36       ` Rafael J. Wysocki
2017-07-06 21:52 ` [RFC v2 3/5] hmem: add heterogeneous memory sysfs support Ross Zwisler
2017-07-07  5:53   ` John Hubbard
2017-07-07 16:32     ` Ross Zwisler
2017-07-06 21:52 ` [RFC v2 4/5] sysfs: add sysfs_add_group_link() Ross Zwisler
2017-07-06 21:52 ` [RFC v2 5/5] hmem: add performance attributes Ross Zwisler
2017-07-06 23:08 ` [RFC v2 0/5] surface heterogeneous memory performance information Jerome Glisse
2017-07-06 23:30   ` Dave Hansen
2017-07-07  5:30 ` John Hubbard
2017-07-07 16:30   ` Ross Zwisler
2017-07-07  6:27 ` Balbir Singh
2017-07-07 16:19   ` Dave Hansen
2017-07-07 16:25   ` Ross Zwisler [this message]
2017-07-19  9:48 ` Bob Liu
2017-07-19 15:25   ` Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170707162512.GA22856@linux.intel.com \
    --to=ross.zwisler@linux.intel.com \
    --cc=Jaroslaw.Kogut@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=bsingharora@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=david.e.box@intel.com \
    --cc=devel@acpica.org \
    --cc=erik.schmauss@intel.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=jglisse@redhat.com \
    --cc=joonas.lahtinen@intel.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=lukasz.anaczkowski@intel.com \
    --cc=lukasz.odzioba@intel.com \
    --cc=lv.zheng@intel.com \
    --cc=murugasamy.nachimuthu@intel.com \
    --cc=rafael.j.wysocki@intel.com \
    --cc=rjw@rjwysocki.net \
    --cc=robert.moore@intel.com \
    --cc=tim.c.chen@linux.intel.com \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).