linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Brice Goglin <brice.goglin@gmail.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
	Matthew Wilcox <willy@infradead.org>,
	Dave Hansen <dave.hansen@intel.com>,
	Michal Hocko <mhocko@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Anaczkowski, Lukasz" <lukasz.anaczkowski@intel.com>,
	"Box, David E" <david.e.box@intel.com>,
	"Kogut, Jaroslaw" <Jaroslaw.Kogut@intel.com>,
	"Koss, Marcin" <marcin.koss@intel.com>,
	"Koziej, Artur" <artur.koziej@intel.com>,
	"Lahtinen, Joonas" <joonas.lahtinen@intel.com>,
	"Moore, Robert" <robert.moore@intel.com>,
	"Nachimuthu, Murugasamy" <murugasamy.nachimuthu@intel.com>,
	"Odzioba, Lukasz" <lukasz.odzioba@intel.com>,
	"Rafael J. Wysocki" <rafael.j.wysocki@intel.com>,
	"Rafael J. Wysocki" <rjw@rjwysocki.net>,
	"Schmauss, Erik" <erik.schmauss@intel.com>,
	"Verma, Vishal L" <vishal.l.verma@intel.com>,
	"Zheng, Lv" <lv.zheng@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Balbir Singh <bsingharora@gmail.com>,
	Jerome Glisse <jglisse@redhat.com>,
	John Hubbard <jhubbard@nvidia.com>, Len Brown <lenb@kernel.org>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	devel@acpica.org, Linux ACPI <linux-acpi@vger.kernel.org>,
	Linux MM <linux-mm@kvack.org>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	Linux API <linux-api@vger.kernel.org>,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Subject: Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
Date: Wed, 27 Dec 2017 10:10:34 +0100	[thread overview]
Message-ID: <71317994-af66-a1b2-4c7a-86a03253cf62@gmail.com> (raw)
In-Reply-To: <CAPcyv4j9shdJFrvADa=qW4L-jPJJ4S_TJc_c=aRoW3EmSCCChQ@mail.gmail.com>

Le 22/12/2017 à 23:53, Dan Williams a écrit :
> On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglin <brice.goglin@gmail.com> wrote:
>> Le 20/12/2017 à 23:41, Ross Zwisler a écrit :
> [..]
>> Hello
>>
>> I can confirm that HPC runtimes are going to use these patches (at least
>> all runtimes that use hwloc for topology discovery, but that's the vast
>> majority of HPC anyway).
>>
>> We really didn't like KNL exposing a hacky SLIT table [1]. We had to
>> explicitly detect that specific crazy table to find out which NUMA nodes
>> were local to which cores, and to find out which NUMA nodes were
>> HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the
>> application because the reported latencies didn't match reality. Quite
>> annoying.
>>
>> With Ross' patches, we can easily get what we need:
>> * which NUMA nodes are local to which CPUs? /sys/devices/system/node/
>> can only report a single local node per CPU (doesn't work for KNL and
>> upcoming architectures with HBM+DDR+...)
>> * which NUMA nodes are slow/fast (for both bandwidth and latency)
>> And we can still look at SLIT under /sys/devices/system/node if really
>> needed.
>>
>> And of course having this in sysfs is much better than parsing ACPI
>> tables that are only accessible to root :)
> On this point, it's not clear to me that we should allow these sysfs
> entries to be world readable. Given /proc/iomem now hides physical
> address information from non-root we at least need to be careful not
> to undo that with new sysfs HMAT attributes. Once you need to be root
> for this info, is parsing binary HMAT vs sysfs a blocker for the HPC
> use case?

I don't think it would be a blocker.

> Perhaps we can enlist /proc/iomem or a similar enumeration interface
> to tell userspace the NUMA node and whether the kernel thinks it has
> better or worse performance characteristics relative to base
> system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start
> publishing absolute numbers in sysfs userspace will default to looking
> for specific magic numbers in sysfs vs asking the kernel for memory
> that has performance characteristics relative to base "System RAM". In
> other words the absolute performance information that the HMAT
> publishes is useful to the kernel, but it's not clear that userspace
> needs that vs a relative indicator for making NUMA node preference
> decisions.

Some HPC users will benchmark the machine to discovery actual
performance numbers anyway.
However, most users won't do this. They will want to know relative
performance of different nodes. If you normalize HMAT values by dividing
them with system-RAM values, that's likely OK. If you just say "that
node is faster than system RAM", it's not precise enough.

Brice

  parent reply	other threads:[~2017-12-27  9:10 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20171214021019.13579-1-ross.zwisler@linux.intel.com>
     [not found] ` <20171214130032.GK16951@dhcp22.suse.cz>
     [not found]   ` <20171218203547.GA2366@linux.intel.com>
2017-12-20 18:19     ` [PATCH v3 0/3] create sysfs representation of ACPI HMAT Matthew Wilcox
2017-12-20 20:22       ` Dave Hansen
2017-12-20 21:16         ` Matthew Wilcox
2017-12-20 21:24           ` Ross Zwisler
2017-12-20 22:29             ` Dan Williams
2017-12-20 22:41               ` Ross Zwisler
2017-12-21 20:31                 ` Brice Goglin
2017-12-22 22:53                   ` Dan Williams
2017-12-22 23:22                     ` Ross Zwisler
2017-12-22 23:57                       ` Dan Williams
2017-12-23  1:14                         ` Rafael J. Wysocki
2017-12-27  9:10                     ` Brice Goglin [this message]
2017-12-30  6:58                       ` Matthew Wilcox
2017-12-30  9:19                         ` Brice Goglin
2017-12-20 21:13       ` Ross Zwisler
2017-12-21  1:41         ` Elliott, Robert (Persistent Memory)
2017-12-22 21:46           ` Ross Zwisler
2017-12-21 12:50       ` Michael Ellerman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=71317994-af66-a1b2-4c7a-86a03253cf62@gmail.com \
    --to=brice.goglin@gmail.com \
    --cc=Jaroslaw.Kogut@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=artur.koziej@intel.com \
    --cc=bsingharora@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=david.e.box@intel.com \
    --cc=devel@acpica.org \
    --cc=erik.schmauss@intel.com \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=joonas.lahtinen@intel.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=lukasz.anaczkowski@intel.com \
    --cc=lukasz.odzioba@intel.com \
    --cc=lv.zheng@intel.com \
    --cc=marcin.koss@intel.com \
    --cc=mhocko@kernel.org \
    --cc=murugasamy.nachimuthu@intel.com \
    --cc=rafael.j.wysocki@intel.com \
    --cc=rjw@rjwysocki.net \
    --cc=robert.moore@intel.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=tim.c.chen@linux.intel.com \
    --cc=vishal.l.verma@intel.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).