* Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
2017-12-20 18:19 ` [PATCH v3 0/3] create sysfs representation of ACPI HMAT Matthew Wilcox
@ 2017-12-20 20:22 ` Dave Hansen
2017-12-20 21:16 ` Matthew Wilcox
2017-12-20 21:13 ` Ross Zwisler
2017-12-21 12:50 ` Michael Ellerman
2 siblings, 1 reply; 18+ messages in thread
From: Dave Hansen @ 2017-12-20 20:22 UTC (permalink / raw)
To: Matthew Wilcox, Ross Zwisler
Cc: Michal Hocko, linux-kernel, Anaczkowski, Lukasz, Box, David E,
Kogut, Jaroslaw, Koss, Marcin, Koziej, Artur, Lahtinen, Joonas,
Moore, Robert, Nachimuthu, Murugasamy, Odzioba, Lukasz,
Rafael J. Wysocki, Rafael J. Wysocki, Schmauss, Erik,
Verma, Vishal L, Zheng, Lv, Andrew Morton, Balbir Singh,
Brice Goglin, Dan Williams, Jerome Glisse, John Hubbard,
Len Brown, Tim Chen, devel, linux-acpi, linux-mm, linux-nvdimm,
linux-api, linuxppc-dev
On 12/20/2017 10:19 AM, Matthew Wilcox wrote:
> I don't know what the right interface is, but my laptop has a set of
> /sys/devices/system/memory/memoryN/ directories. Perhaps this is the
> right place to expose write_bw (etc).
Those directories are already too redundant and wasteful. I think we'd
really rather not add to them. In addition, it's technically possible
to have a memory section span NUMA nodes and have different performance
properties, which make it impossible to represent there.
In any case, ACPI PXM's (Proximity Domains) are guaranteed to have
uniform performance properties in the HMAT, and we just so happen to
always create one NUMA node per PXM. So, NUMA nodes really are a good fit.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
2017-12-20 20:22 ` Dave Hansen
@ 2017-12-20 21:16 ` Matthew Wilcox
2017-12-20 21:24 ` Ross Zwisler
0 siblings, 1 reply; 18+ messages in thread
From: Matthew Wilcox @ 2017-12-20 21:16 UTC (permalink / raw)
To: Dave Hansen
Cc: Ross Zwisler, Michal Hocko, linux-kernel, Anaczkowski, Lukasz,
Box, David E, Kogut, Jaroslaw, Koss, Marcin, Koziej, Artur,
Lahtinen, Joonas, Moore, Robert, Nachimuthu, Murugasamy,
Odzioba, Lukasz, Rafael J. Wysocki, Rafael J. Wysocki,
Schmauss, Erik, Verma, Vishal L, Zheng, Lv, Andrew Morton,
Balbir Singh, Brice Goglin, Dan Williams, Jerome Glisse,
John Hubbard, Len Brown, Tim Chen, devel, linux-acpi, linux-mm,
linux-nvdimm, linux-api, linuxppc-dev
On Wed, Dec 20, 2017 at 12:22:21PM -0800, Dave Hansen wrote:
> On 12/20/2017 10:19 AM, Matthew Wilcox wrote:
> > I don't know what the right interface is, but my laptop has a set of
> > /sys/devices/system/memory/memoryN/ directories. Perhaps this is the
> > right place to expose write_bw (etc).
>
> Those directories are already too redundant and wasteful. I think we'd
> really rather not add to them. In addition, it's technically possible
> to have a memory section span NUMA nodes and have different performance
> properties, which make it impossible to represent there.
>
> In any case, ACPI PXM's (Proximity Domains) are guaranteed to have
> uniform performance properties in the HMAT, and we just so happen to
> always create one NUMA node per PXM. So, NUMA nodes really are a good fit.
I think you're missing my larger point which is that I don't think this
should be exposed to userspace as an ACPI feature. Because if you do,
then it'll also be exposed to userspace as an openfirmware feature.
And sooner or later a devicetree feature. And then writing a portable
program becomes an exercise in suffering.
So, what's the right place in sysfs that isn't tied to ACPI? A new
directory or set of directories under /sys/devices/system/memory/ ?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
2017-12-20 21:16 ` Matthew Wilcox
@ 2017-12-20 21:24 ` Ross Zwisler
2017-12-20 22:29 ` Dan Williams
0 siblings, 1 reply; 18+ messages in thread
From: Ross Zwisler @ 2017-12-20 21:24 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Dave Hansen, Ross Zwisler, Michal Hocko, linux-kernel,
Anaczkowski, Lukasz, Box, David E, Kogut, Jaroslaw, Koss, Marcin,
Koziej, Artur, Lahtinen, Joonas, Moore, Robert,
Nachimuthu, Murugasamy, Odzioba, Lukasz, Rafael J. Wysocki,
Rafael J. Wysocki, Schmauss, Erik, Verma, Vishal L, Zheng, Lv,
Andrew Morton, Balbir Singh, Brice Goglin, Dan Williams,
Jerome Glisse, John Hubbard, Len Brown, Tim Chen, devel,
linux-acpi, linux-mm, linux-nvdimm, linux-api, linuxppc-dev
On Wed, Dec 20, 2017 at 01:16:49PM -0800, Matthew Wilcox wrote:
> On Wed, Dec 20, 2017 at 12:22:21PM -0800, Dave Hansen wrote:
> > On 12/20/2017 10:19 AM, Matthew Wilcox wrote:
> > > I don't know what the right interface is, but my laptop has a set of
> > > /sys/devices/system/memory/memoryN/ directories. Perhaps this is the
> > > right place to expose write_bw (etc).
> >
> > Those directories are already too redundant and wasteful. I think we'd
> > really rather not add to them. In addition, it's technically possible
> > to have a memory section span NUMA nodes and have different performance
> > properties, which make it impossible to represent there.
> >
> > In any case, ACPI PXM's (Proximity Domains) are guaranteed to have
> > uniform performance properties in the HMAT, and we just so happen to
> > always create one NUMA node per PXM. So, NUMA nodes really are a good fit.
>
> I think you're missing my larger point which is that I don't think this
> should be exposed to userspace as an ACPI feature. Because if you do,
> then it'll also be exposed to userspace as an openfirmware feature.
> And sooner or later a devicetree feature. And then writing a portable
> program becomes an exercise in suffering.
>
> So, what's the right place in sysfs that isn't tied to ACPI? A new
> directory or set of directories under /sys/devices/system/memory/ ?
Oh, the current location isn't at all tied to acpi except that it happens to
be named 'hmat'. When it was all named 'hmem' it was just:
/sys/devices/system/hmem
Which has no ACPI-isms at all. I'm happy to move it under
/sys/devices/system/memory/hmat if that's helpful, but I think we still have
the issue that the data represented therein is still pulled right from the
HMAT, and I don't know how to abstract it into something more platform
agnostic until I know what data is provided by those other platforms.
For example, the HMAT provides latency information and bandwidth information
for both reads and writes. Will the devicetree/openfirmware/etc version have
this same info, or will it be just different enough that it won't translate
into whatever I choose to stick in sysfs?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
2017-12-20 21:24 ` Ross Zwisler
@ 2017-12-20 22:29 ` Dan Williams
2017-12-20 22:41 ` Ross Zwisler
0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2017-12-20 22:29 UTC (permalink / raw)
To: Ross Zwisler
Cc: Matthew Wilcox, Dave Hansen, Michal Hocko,
linux-kernel@vger.kernel.org, Anaczkowski, Lukasz, Box, David E,
Kogut, Jaroslaw, Koss, Marcin, Koziej, Artur, Lahtinen, Joonas,
Moore, Robert, Nachimuthu, Murugasamy, Odzioba, Lukasz,
Rafael J. Wysocki, Rafael J. Wysocki, Schmauss, Erik,
Verma, Vishal L, Zheng, Lv, Andrew Morton, Balbir Singh,
Brice Goglin, Jerome Glisse, John Hubbard, Len Brown, Tim Chen,
devel, Linux ACPI, Linux MM, linux-nvdimm@lists.01.org, Linux API,
linuxppc-dev
On Wed, Dec 20, 2017 at 1:24 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Wed, Dec 20, 2017 at 01:16:49PM -0800, Matthew Wilcox wrote:
>> On Wed, Dec 20, 2017 at 12:22:21PM -0800, Dave Hansen wrote:
>> > On 12/20/2017 10:19 AM, Matthew Wilcox wrote:
>> > > I don't know what the right interface is, but my laptop has a set of
>> > > /sys/devices/system/memory/memoryN/ directories. Perhaps this is the
>> > > right place to expose write_bw (etc).
>> >
>> > Those directories are already too redundant and wasteful. I think we'd
>> > really rather not add to them. In addition, it's technically possible
>> > to have a memory section span NUMA nodes and have different performance
>> > properties, which make it impossible to represent there.
>> >
>> > In any case, ACPI PXM's (Proximity Domains) are guaranteed to have
>> > uniform performance properties in the HMAT, and we just so happen to
>> > always create one NUMA node per PXM. So, NUMA nodes really are a good fit.
>>
>> I think you're missing my larger point which is that I don't think this
>> should be exposed to userspace as an ACPI feature. Because if you do,
>> then it'll also be exposed to userspace as an openfirmware feature.
>> And sooner or later a devicetree feature. And then writing a portable
>> program becomes an exercise in suffering.
>>
>> So, what's the right place in sysfs that isn't tied to ACPI? A new
>> directory or set of directories under /sys/devices/system/memory/ ?
>
> Oh, the current location isn't at all tied to acpi except that it happens to
> be named 'hmat'. When it was all named 'hmem' it was just:
>
> /sys/devices/system/hmem
>
> Which has no ACPI-isms at all. I'm happy to move it under
> /sys/devices/system/memory/hmat if that's helpful, but I think we still have
> the issue that the data represented therein is still pulled right from the
> HMAT, and I don't know how to abstract it into something more platform
> agnostic until I know what data is provided by those other platforms.
>
> For example, the HMAT provides latency information and bandwidth information
> for both reads and writes. Will the devicetree/openfirmware/etc version have
> this same info, or will it be just different enough that it won't translate
> into whatever I choose to stick in sysfs?
For the initial implementation do we need to have a representation of
all the performance data? Given that
/sys/devices/system/node/nodeX/distance is the only generic
performance attribute published by the kernel today it is already the
case that applications that need to target specific memories need to
go parse information that is not provided by the kernel by default.
The question is can those specialized applications stay special and go
parse the platform specific data sources, like raw HMAT, directly, or
do we expect general purpose applications to make use of this data? I
think a firmware-id to numa-node translation facility
(/sys/devices/system/node/nodeX/fwid) is a simple start that we can
build on with more information as specific use cases arise.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
2017-12-20 22:29 ` Dan Williams
@ 2017-12-20 22:41 ` Ross Zwisler
2017-12-21 20:31 ` Brice Goglin
0 siblings, 1 reply; 18+ messages in thread
From: Ross Zwisler @ 2017-12-20 22:41 UTC (permalink / raw)
To: Dan Williams
Cc: Ross Zwisler, Matthew Wilcox, Dave Hansen, Michal Hocko,
linux-kernel@vger.kernel.org, Anaczkowski, Lukasz, Box, David E,
Kogut, Jaroslaw, Koss, Marcin, Koziej, Artur, Lahtinen, Joonas,
Moore, Robert, Nachimuthu, Murugasamy, Odzioba, Lukasz,
Rafael J. Wysocki, Rafael J. Wysocki, Schmauss, Erik,
Verma, Vishal L, Zheng, Lv, Andrew Morton, Balbir Singh,
Brice Goglin, Jerome Glisse, John Hubbard, Len Brown, Tim Chen,
devel, Linux ACPI, Linux MM, linux-nvdimm@lists.01.org, Linux API,
linuxppc-dev
On Wed, Dec 20, 2017 at 02:29:56PM -0800, Dan Williams wrote:
> On Wed, Dec 20, 2017 at 1:24 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Wed, Dec 20, 2017 at 01:16:49PM -0800, Matthew Wilcox wrote:
> >> On Wed, Dec 20, 2017 at 12:22:21PM -0800, Dave Hansen wrote:
> >> > On 12/20/2017 10:19 AM, Matthew Wilcox wrote:
> >> > > I don't know what the right interface is, but my laptop has a set of
> >> > > /sys/devices/system/memory/memoryN/ directories. Perhaps this is the
> >> > > right place to expose write_bw (etc).
> >> >
> >> > Those directories are already too redundant and wasteful. I think we'd
> >> > really rather not add to them. In addition, it's technically possible
> >> > to have a memory section span NUMA nodes and have different performance
> >> > properties, which make it impossible to represent there.
> >> >
> >> > In any case, ACPI PXM's (Proximity Domains) are guaranteed to have
> >> > uniform performance properties in the HMAT, and we just so happen to
> >> > always create one NUMA node per PXM. So, NUMA nodes really are a good fit.
> >>
> >> I think you're missing my larger point which is that I don't think this
> >> should be exposed to userspace as an ACPI feature. Because if you do,
> >> then it'll also be exposed to userspace as an openfirmware feature.
> >> And sooner or later a devicetree feature. And then writing a portable
> >> program becomes an exercise in suffering.
> >>
> >> So, what's the right place in sysfs that isn't tied to ACPI? A new
> >> directory or set of directories under /sys/devices/system/memory/ ?
> >
> > Oh, the current location isn't at all tied to acpi except that it happens to
> > be named 'hmat'. When it was all named 'hmem' it was just:
> >
> > /sys/devices/system/hmem
> >
> > Which has no ACPI-isms at all. I'm happy to move it under
> > /sys/devices/system/memory/hmat if that's helpful, but I think we still have
> > the issue that the data represented therein is still pulled right from the
> > HMAT, and I don't know how to abstract it into something more platform
> > agnostic until I know what data is provided by those other platforms.
> >
> > For example, the HMAT provides latency information and bandwidth information
> > for both reads and writes. Will the devicetree/openfirmware/etc version have
> > this same info, or will it be just different enough that it won't translate
> > into whatever I choose to stick in sysfs?
>
> For the initial implementation do we need to have a representation of
> all the performance data? Given that
> /sys/devices/system/node/nodeX/distance is the only generic
> performance attribute published by the kernel today it is already the
> case that applications that need to target specific memories need to
> go parse information that is not provided by the kernel by default.
> The question is can those specialized applications stay special and go
> parse the platform specific data sources, like raw HMAT, directly, or
> do we expect general purpose applications to make use of this data? I
> think a firmware-id to numa-node translation facility
> (/sys/devices/system/node/nodeX/fwid) is a simple start that we can
> build on with more information as specific use cases arise.
We don't represent all the performance data, we only represent the data for
local initiator/target pairs. I do think that this is useful to have in sysfs
because it provides a way to easily answer the most commonly asked questions
(or at least what I'm guessing will be the most commmonly asked queststions),
i.e. "given a CPU, what are the speeds of the various types of memory attached
to it", and "given a chunk of memory, how fast is it and to which CPU is it
local"? By providing this base level of information I'm hoping to prevent
most applications from having to parse the HMAT directly.
The question of whether or not to include this local performance information
was one of the main questions of the initial RFC patch series, and I did get
feedback (albiet off-list) that the local performance information was
valuable to at least some users. I did intentionally structure my (now very
short) set so that the performance information was added as a separate patch,
so we can get to the place you're talking about where we only provide firmware
id <=> proximity domain mappings by just leaving off the last patch in the
series.
I'm personally still of the opinion though that this last patch does add
value.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
2017-12-20 22:41 ` Ross Zwisler
@ 2017-12-21 20:31 ` Brice Goglin
2017-12-22 22:53 ` Dan Williams
0 siblings, 1 reply; 18+ messages in thread
From: Brice Goglin @ 2017-12-21 20:31 UTC (permalink / raw)
To: Ross Zwisler, Dan Williams
Cc: Matthew Wilcox, Dave Hansen, Michal Hocko,
linux-kernel@vger.kernel.org, Anaczkowski, Lukasz, Box, David E,
Kogut, Jaroslaw, Koss, Marcin, Koziej, Artur, Lahtinen, Joonas,
Moore, Robert, Nachimuthu, Murugasamy, Odzioba, Lukasz,
Rafael J. Wysocki, Rafael J. Wysocki, Schmauss, Erik,
Verma, Vishal L, Zheng, Lv, Andrew Morton, Balbir Singh,
Jerome Glisse, John Hubbard, Len Brown, Tim Chen, devel,
Linux ACPI, Linux MM, linux-nvdimm@lists.01.org, Linux API,
linuxppc-dev
Le 20/12/2017 à 23:41, Ross Zwisler a écrit :
> On Wed, Dec 20, 2017 at 02:29:56PM -0800, Dan Williams wrote:
>> On Wed, Dec 20, 2017 at 1:24 PM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>>> On Wed, Dec 20, 2017 at 01:16:49PM -0800, Matthew Wilcox wrote:
>>>> On Wed, Dec 20, 2017 at 12:22:21PM -0800, Dave Hansen wrote:
>>>>> On 12/20/2017 10:19 AM, Matthew Wilcox wrote:
>>>>>> I don't know what the right interface is, but my laptop has a set of
>>>>>> /sys/devices/system/memory/memoryN/ directories. Perhaps this is the
>>>>>> right place to expose write_bw (etc).
>>>>> Those directories are already too redundant and wasteful. I think we'd
>>>>> really rather not add to them. In addition, it's technically possible
>>>>> to have a memory section span NUMA nodes and have different performance
>>>>> properties, which make it impossible to represent there.
>>>>>
>>>>> In any case, ACPI PXM's (Proximity Domains) are guaranteed to have
>>>>> uniform performance properties in the HMAT, and we just so happen to
>>>>> always create one NUMA node per PXM. So, NUMA nodes really are a good fit.
>>>> I think you're missing my larger point which is that I don't think this
>>>> should be exposed to userspace as an ACPI feature. Because if you do,
>>>> then it'll also be exposed to userspace as an openfirmware feature.
>>>> And sooner or later a devicetree feature. And then writing a portable
>>>> program becomes an exercise in suffering.
>>>>
>>>> So, what's the right place in sysfs that isn't tied to ACPI? A new
>>>> directory or set of directories under /sys/devices/system/memory/ ?
>>> Oh, the current location isn't at all tied to acpi except that it happens to
>>> be named 'hmat'. When it was all named 'hmem' it was just:
>>>
>>> /sys/devices/system/hmem
>>>
>>> Which has no ACPI-isms at all. I'm happy to move it under
>>> /sys/devices/system/memory/hmat if that's helpful, but I think we still have
>>> the issue that the data represented therein is still pulled right from the
>>> HMAT, and I don't know how to abstract it into something more platform
>>> agnostic until I know what data is provided by those other platforms.
>>>
>>> For example, the HMAT provides latency information and bandwidth information
>>> for both reads and writes. Will the devicetree/openfirmware/etc version have
>>> this same info, or will it be just different enough that it won't translate
>>> into whatever I choose to stick in sysfs?
>> For the initial implementation do we need to have a representation of
>> all the performance data? Given that
>> /sys/devices/system/node/nodeX/distance is the only generic
>> performance attribute published by the kernel today it is already the
>> case that applications that need to target specific memories need to
>> go parse information that is not provided by the kernel by default.
>> The question is can those specialized applications stay special and go
>> parse the platform specific data sources, like raw HMAT, directly, or
>> do we expect general purpose applications to make use of this data? I
>> think a firmware-id to numa-node translation facility
>> (/sys/devices/system/node/nodeX/fwid) is a simple start that we can
>> build on with more information as specific use cases arise.
> We don't represent all the performance data, we only represent the data for
> local initiator/target pairs. I do think that this is useful to have in sysfs
> because it provides a way to easily answer the most commonly asked questions
> (or at least what I'm guessing will be the most commmonly asked queststions),
> i.e. "given a CPU, what are the speeds of the various types of memory attached
> to it", and "given a chunk of memory, how fast is it and to which CPU is it
> local"? By providing this base level of information I'm hoping to prevent
> most applications from having to parse the HMAT directly.
>
> The question of whether or not to include this local performance information
> was one of the main questions of the initial RFC patch series, and I did get
> feedback (albiet off-list) that the local performance information was
> valuable to at least some users. I did intentionally structure my (now very
> short) set so that the performance information was added as a separate patch,
> so we can get to the place you're talking about where we only provide firmware
> id <=> proximity domain mappings by just leaving off the last patch in the
> series.
>
Hello
I can confirm that HPC runtimes are going to use these patches (at least
all runtimes that use hwloc for topology discovery, but that's the vast
majority of HPC anyway).
We really didn't like KNL exposing a hacky SLIT table [1]. We had to
explicitly detect that specific crazy table to find out which NUMA nodes
were local to which cores, and to find out which NUMA nodes were
HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the
application because the reported latencies didn't match reality. Quite
annoying.
With Ross' patches, we can easily get what we need:
* which NUMA nodes are local to which CPUs? /sys/devices/system/node/
can only report a single local node per CPU (doesn't work for KNL and
upcoming architectures with HBM+DDR+...)
* which NUMA nodes are slow/fast (for both bandwidth and latency)
And we can still look at SLIT under /sys/devices/system/node if really
needed.
And of course having this in sysfs is much better than parsing ACPI
tables that are only accessible to root :)
Regards
Brice
[1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
2017-12-21 20:31 ` Brice Goglin
@ 2017-12-22 22:53 ` Dan Williams
2017-12-22 23:22 ` Ross Zwisler
2017-12-27 9:10 ` Brice Goglin
0 siblings, 2 replies; 18+ messages in thread
From: Dan Williams @ 2017-12-22 22:53 UTC (permalink / raw)
To: Brice Goglin
Cc: Ross Zwisler, Matthew Wilcox, Dave Hansen, Michal Hocko,
linux-kernel@vger.kernel.org, Anaczkowski, Lukasz, Box, David E,
Kogut, Jaroslaw, Koss, Marcin, Koziej, Artur, Lahtinen, Joonas,
Moore, Robert, Nachimuthu, Murugasamy, Odzioba, Lukasz,
Rafael J. Wysocki, Rafael J. Wysocki, Schmauss, Erik,
Verma, Vishal L, Zheng, Lv, Andrew Morton, Balbir Singh,
Jerome Glisse, John Hubbard, Len Brown, Tim Chen, devel,
Linux ACPI, Linux MM, linux-nvdimm@lists.01.org, Linux API,
linuxppc-dev
On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglin <brice.goglin@gmail.com> wro=
te:
> Le 20/12/2017 =C3=A0 23:41, Ross Zwisler a =C3=A9crit :
[..]
> Hello
>
> I can confirm that HPC runtimes are going to use these patches (at least
> all runtimes that use hwloc for topology discovery, but that's the vast
> majority of HPC anyway).
>
> We really didn't like KNL exposing a hacky SLIT table [1]. We had to
> explicitly detect that specific crazy table to find out which NUMA nodes
> were local to which cores, and to find out which NUMA nodes were
> HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the
> application because the reported latencies didn't match reality. Quite
> annoying.
>
> With Ross' patches, we can easily get what we need:
> * which NUMA nodes are local to which CPUs? /sys/devices/system/node/
> can only report a single local node per CPU (doesn't work for KNL and
> upcoming architectures with HBM+DDR+...)
> * which NUMA nodes are slow/fast (for both bandwidth and latency)
> And we can still look at SLIT under /sys/devices/system/node if really
> needed.
>
> And of course having this in sysfs is much better than parsing ACPI
> tables that are only accessible to root :)
On this point, it's not clear to me that we should allow these sysfs
entries to be world readable. Given /proc/iomem now hides physical
address information from non-root we at least need to be careful not
to undo that with new sysfs HMAT attributes. Once you need to be root
for this info, is parsing binary HMAT vs sysfs a blocker for the HPC
use case?
Perhaps we can enlist /proc/iomem or a similar enumeration interface
to tell userspace the NUMA node and whether the kernel thinks it has
better or worse performance characteristics relative to base
system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start
publishing absolute numbers in sysfs userspace will default to looking
for specific magic numbers in sysfs vs asking the kernel for memory
that has performance characteristics relative to base "System RAM". In
other words the absolute performance information that the HMAT
publishes is useful to the kernel, but it's not clear that userspace
needs that vs a relative indicator for making NUMA node preference
decisions.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
2017-12-22 22:53 ` Dan Williams
@ 2017-12-22 23:22 ` Ross Zwisler
2017-12-22 23:57 ` Dan Williams
2017-12-27 9:10 ` Brice Goglin
1 sibling, 1 reply; 18+ messages in thread
From: Ross Zwisler @ 2017-12-22 23:22 UTC (permalink / raw)
To: Dan Williams
Cc: Brice Goglin, Ross Zwisler, Matthew Wilcox, Dave Hansen,
Michal Hocko, linux-kernel@vger.kernel.org, Anaczkowski, Lukasz,
Box, David E, Kogut, Jaroslaw, Koss, Marcin, Koziej, Artur,
Lahtinen, Joonas, Moore, Robert, Nachimuthu, Murugasamy,
Odzioba, Lukasz, Rafael J. Wysocki, Rafael J. Wysocki,
Schmauss, Erik, Verma, Vishal L, Zheng, Lv, Andrew Morton,
Balbir Singh, Jerome Glisse, John Hubbard, Len Brown, Tim Chen,
devel, Linux ACPI, Linux MM, linux-nvdimm@lists.01.org, Linux API,
linuxppc-dev
On Fri, Dec 22, 2017 at 02:53:42PM -0800, Dan Williams wrote:
> On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglin <brice.goglin@gmail.com> wrote:
> > Le 20/12/2017 à 23:41, Ross Zwisler a écrit :
> [..]
> > Hello
> >
> > I can confirm that HPC runtimes are going to use these patches (at least
> > all runtimes that use hwloc for topology discovery, but that's the vast
> > majority of HPC anyway).
> >
> > We really didn't like KNL exposing a hacky SLIT table [1]. We had to
> > explicitly detect that specific crazy table to find out which NUMA nodes
> > were local to which cores, and to find out which NUMA nodes were
> > HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the
> > application because the reported latencies didn't match reality. Quite
> > annoying.
> >
> > With Ross' patches, we can easily get what we need:
> > * which NUMA nodes are local to which CPUs? /sys/devices/system/node/
> > can only report a single local node per CPU (doesn't work for KNL and
> > upcoming architectures with HBM+DDR+...)
> > * which NUMA nodes are slow/fast (for both bandwidth and latency)
> > And we can still look at SLIT under /sys/devices/system/node if really
> > needed.
> >
> > And of course having this in sysfs is much better than parsing ACPI
> > tables that are only accessible to root :)
>
> On this point, it's not clear to me that we should allow these sysfs
> entries to be world readable. Given /proc/iomem now hides physical
> address information from non-root we at least need to be careful not
> to undo that with new sysfs HMAT attributes.
This enabling does not expose any physical addresses to userspace. It only
provides performance numbers from the HMAT and associates them with existing
NUMA nodes. Are you worried that exposing performance numbers to non-root
users via sysfs poses a security risk?
> Once you need to be root for this info, is parsing binary HMAT vs sysfs a
> blocker for the HPC use case?
>
> Perhaps we can enlist /proc/iomem or a similar enumeration interface
> to tell userspace the NUMA node and whether the kernel thinks it has
> better or worse performance characteristics relative to base
> system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start
> publishing absolute numbers in sysfs userspace will default to looking
> for specific magic numbers in sysfs vs asking the kernel for memory
> that has performance characteristics relative to base "System RAM". In
> other words the absolute performance information that the HMAT
> publishes is useful to the kernel, but it's not clear that userspace
> needs that vs a relative indicator for making NUMA node preference
> decisions.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
2017-12-22 23:22 ` Ross Zwisler
@ 2017-12-22 23:57 ` Dan Williams
2017-12-23 1:14 ` Rafael J. Wysocki
0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2017-12-22 23:57 UTC (permalink / raw)
To: Ross Zwisler
Cc: Brice Goglin, Matthew Wilcox, Dave Hansen, Michal Hocko,
linux-kernel@vger.kernel.org, Anaczkowski, Lukasz, Box, David E,
Kogut, Jaroslaw, Koss, Marcin, Koziej, Artur, Lahtinen, Joonas,
Moore, Robert, Nachimuthu, Murugasamy, Odzioba, Lukasz,
Rafael J. Wysocki, Rafael J. Wysocki, Schmauss, Erik,
Verma, Vishal L, Zheng, Lv, Andrew Morton, Balbir Singh,
Jerome Glisse, John Hubbard, Len Brown, Tim Chen, devel,
Linux ACPI, Linux MM, linux-nvdimm@lists.01.org, Linux API,
linuxppc-dev
On Fri, Dec 22, 2017 at 3:22 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Fri, Dec 22, 2017 at 02:53:42PM -0800, Dan Williams wrote:
>> On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglin <brice.goglin@gmail.com> =
wrote:
>> > Le 20/12/2017 =C3=A0 23:41, Ross Zwisler a =C3=A9crit :
>> [..]
>> > Hello
>> >
>> > I can confirm that HPC runtimes are going to use these patches (at lea=
st
>> > all runtimes that use hwloc for topology discovery, but that's the vas=
t
>> > majority of HPC anyway).
>> >
>> > We really didn't like KNL exposing a hacky SLIT table [1]. We had to
>> > explicitly detect that specific crazy table to find out which NUMA nod=
es
>> > were local to which cores, and to find out which NUMA nodes were
>> > HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the
>> > application because the reported latencies didn't match reality. Quite
>> > annoying.
>> >
>> > With Ross' patches, we can easily get what we need:
>> > * which NUMA nodes are local to which CPUs? /sys/devices/system/node/
>> > can only report a single local node per CPU (doesn't work for KNL and
>> > upcoming architectures with HBM+DDR+...)
>> > * which NUMA nodes are slow/fast (for both bandwidth and latency)
>> > And we can still look at SLIT under /sys/devices/system/node if really
>> > needed.
>> >
>> > And of course having this in sysfs is much better than parsing ACPI
>> > tables that are only accessible to root :)
>>
>> On this point, it's not clear to me that we should allow these sysfs
>> entries to be world readable. Given /proc/iomem now hides physical
>> address information from non-root we at least need to be careful not
>> to undo that with new sysfs HMAT attributes.
>
> This enabling does not expose any physical addresses to userspace. It on=
ly
> provides performance numbers from the HMAT and associates them with exist=
ing
> NUMA nodes. Are you worried that exposing performance numbers to non-roo=
t
> users via sysfs poses a security risk?
It's an information disclosure that's not clear we need to make to
non-root processes.
I'm more worried about userspace growing dependencies on the absolute
numbers when those numbers can change from platform to platform.
Differentiated memory on one platform may be the common memory pool on
another.
To me this has parallels with storage device hinting where
specifications like T10 have a complex enumeration of all the
performance hints that can be passed to the device, but the Linux
enabling effort aims for a sanitzed set of relative hints that make
sense. It's more flexible if userspace specifies a relative intent
rather than an absolute performance target. Putting all the HMAT
information into sysfs gives userspace more information than it could
possibly do anything reasonable, at least outside of specialized apps
that are hand tuned for a given hardware platform.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
2017-12-22 23:57 ` Dan Williams
@ 2017-12-23 1:14 ` Rafael J. Wysocki
0 siblings, 0 replies; 18+ messages in thread
From: Rafael J. Wysocki @ 2017-12-23 1:14 UTC (permalink / raw)
To: Dan Williams
Cc: Ross Zwisler, Brice Goglin, Matthew Wilcox, Dave Hansen,
Michal Hocko, linux-kernel@vger.kernel.org, Anaczkowski, Lukasz,
Box, David E, Kogut, Jaroslaw, Koss, Marcin, Koziej, Artur,
Lahtinen, Joonas, Moore, Robert, Nachimuthu, Murugasamy,
Odzioba, Lukasz, Rafael J. Wysocki, Rafael J. Wysocki,
Schmauss, Erik, Verma, Vishal L, Zheng, Lv, Andrew Morton,
Balbir Singh, Jerome Glisse, John Hubbard, Len Brown, Tim Chen,
devel, Linux ACPI, Linux MM, linux-nvdimm@lists.01.org, Linux API,
linuxppc-dev
On Sat, Dec 23, 2017 at 12:57 AM, Dan Williams <dan.j.williams@intel.com> w=
rote:
> On Fri, Dec 22, 2017 at 3:22 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
>> On Fri, Dec 22, 2017 at 02:53:42PM -0800, Dan Williams wrote:
>>> On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglin <brice.goglin@gmail.com>=
wrote:
>>> > Le 20/12/2017 =C3=A0 23:41, Ross Zwisler a =C3=A9crit :
>>> [..]
>>> > Hello
>>> >
>>> > I can confirm that HPC runtimes are going to use these patches (at le=
ast
>>> > all runtimes that use hwloc for topology discovery, but that's the va=
st
>>> > majority of HPC anyway).
>>> >
>>> > We really didn't like KNL exposing a hacky SLIT table [1]. We had to
>>> > explicitly detect that specific crazy table to find out which NUMA no=
des
>>> > were local to which cores, and to find out which NUMA nodes were
>>> > HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the
>>> > application because the reported latencies didn't match reality. Quit=
e
>>> > annoying.
>>> >
>>> > With Ross' patches, we can easily get what we need:
>>> > * which NUMA nodes are local to which CPUs? /sys/devices/system/node/
>>> > can only report a single local node per CPU (doesn't work for KNL and
>>> > upcoming architectures with HBM+DDR+...)
>>> > * which NUMA nodes are slow/fast (for both bandwidth and latency)
>>> > And we can still look at SLIT under /sys/devices/system/node if reall=
y
>>> > needed.
>>> >
>>> > And of course having this in sysfs is much better than parsing ACPI
>>> > tables that are only accessible to root :)
>>>
>>> On this point, it's not clear to me that we should allow these sysfs
>>> entries to be world readable. Given /proc/iomem now hides physical
>>> address information from non-root we at least need to be careful not
>>> to undo that with new sysfs HMAT attributes.
>>
>> This enabling does not expose any physical addresses to userspace. It o=
nly
>> provides performance numbers from the HMAT and associates them with exis=
ting
>> NUMA nodes. Are you worried that exposing performance numbers to non-ro=
ot
>> users via sysfs poses a security risk?
>
> It's an information disclosure that's not clear we need to make to
> non-root processes.
>
> I'm more worried about userspace growing dependencies on the absolute
> numbers when those numbers can change from platform to platform.
> Differentiated memory on one platform may be the common memory pool on
> another.
>
> To me this has parallels with storage device hinting where
> specifications like T10 have a complex enumeration of all the
> performance hints that can be passed to the device, but the Linux
> enabling effort aims for a sanitzed set of relative hints that make
> sense. It's more flexible if userspace specifies a relative intent
> rather than an absolute performance target. Putting all the HMAT
> information into sysfs gives userspace more information than it could
> possibly do anything reasonable, at least outside of specialized apps
> that are hand tuned for a given hardware platform.
That's a valid point IMO.
It is sort of tempting to expose everything to user space verbatim,
especially early in the enabling process when the kernel has not yet
found suitable ways to utilize the given information, but the very act
of exposing it may affect what can be done with it in the future.
User space interfaces need to stay around and be supported forever, at
least potentially, so adding every one of them is a serious
commitment.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
2017-12-22 22:53 ` Dan Williams
2017-12-22 23:22 ` Ross Zwisler
@ 2017-12-27 9:10 ` Brice Goglin
2017-12-30 6:58 ` Matthew Wilcox
1 sibling, 1 reply; 18+ messages in thread
From: Brice Goglin @ 2017-12-27 9:10 UTC (permalink / raw)
To: Dan Williams
Cc: Ross Zwisler, Matthew Wilcox, Dave Hansen, Michal Hocko,
linux-kernel@vger.kernel.org, Anaczkowski, Lukasz, Box, David E,
Kogut, Jaroslaw, Koss, Marcin, Koziej, Artur, Lahtinen, Joonas,
Moore, Robert, Nachimuthu, Murugasamy, Odzioba, Lukasz,
Rafael J. Wysocki, Rafael J. Wysocki, Schmauss, Erik,
Verma, Vishal L, Zheng, Lv, Andrew Morton, Balbir Singh,
Jerome Glisse, John Hubbard, Len Brown, Tim Chen, devel,
Linux ACPI, Linux MM, linux-nvdimm@lists.01.org, Linux API,
linuxppc-dev
Le 22/12/2017 à 23:53, Dan Williams a écrit :
> On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglin <brice.goglin@gmail.com> wrote:
>> Le 20/12/2017 à 23:41, Ross Zwisler a écrit :
> [..]
>> Hello
>>
>> I can confirm that HPC runtimes are going to use these patches (at least
>> all runtimes that use hwloc for topology discovery, but that's the vast
>> majority of HPC anyway).
>>
>> We really didn't like KNL exposing a hacky SLIT table [1]. We had to
>> explicitly detect that specific crazy table to find out which NUMA nodes
>> were local to which cores, and to find out which NUMA nodes were
>> HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the
>> application because the reported latencies didn't match reality. Quite
>> annoying.
>>
>> With Ross' patches, we can easily get what we need:
>> * which NUMA nodes are local to which CPUs? /sys/devices/system/node/
>> can only report a single local node per CPU (doesn't work for KNL and
>> upcoming architectures with HBM+DDR+...)
>> * which NUMA nodes are slow/fast (for both bandwidth and latency)
>> And we can still look at SLIT under /sys/devices/system/node if really
>> needed.
>>
>> And of course having this in sysfs is much better than parsing ACPI
>> tables that are only accessible to root :)
> On this point, it's not clear to me that we should allow these sysfs
> entries to be world readable. Given /proc/iomem now hides physical
> address information from non-root we at least need to be careful not
> to undo that with new sysfs HMAT attributes. Once you need to be root
> for this info, is parsing binary HMAT vs sysfs a blocker for the HPC
> use case?
I don't think it would be a blocker.
> Perhaps we can enlist /proc/iomem or a similar enumeration interface
> to tell userspace the NUMA node and whether the kernel thinks it has
> better or worse performance characteristics relative to base
> system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start
> publishing absolute numbers in sysfs userspace will default to looking
> for specific magic numbers in sysfs vs asking the kernel for memory
> that has performance characteristics relative to base "System RAM". In
> other words the absolute performance information that the HMAT
> publishes is useful to the kernel, but it's not clear that userspace
> needs that vs a relative indicator for making NUMA node preference
> decisions.
Some HPC users will benchmark the machine to discovery actual
performance numbers anyway.
However, most users won't do this. They will want to know relative
performance of different nodes. If you normalize HMAT values by dividing
them with system-RAM values, that's likely OK. If you just say "that
node is faster than system RAM", it's not precise enough.
Brice
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
2017-12-27 9:10 ` Brice Goglin
@ 2017-12-30 6:58 ` Matthew Wilcox
2017-12-30 9:19 ` Brice Goglin
0 siblings, 1 reply; 18+ messages in thread
From: Matthew Wilcox @ 2017-12-30 6:58 UTC (permalink / raw)
To: Brice Goglin
Cc: Dan Williams, Ross Zwisler, Dave Hansen, Michal Hocko,
linux-kernel@vger.kernel.org, Anaczkowski, Lukasz, Box, David E,
Kogut, Jaroslaw, Koss, Marcin, Koziej, Artur, Lahtinen, Joonas,
Moore, Robert, Nachimuthu, Murugasamy, Odzioba, Lukasz,
Rafael J. Wysocki, Rafael J. Wysocki, Schmauss, Erik,
Verma, Vishal L, Zheng, Lv, Andrew Morton, Balbir Singh,
Jerome Glisse, John Hubbard, Len Brown, Tim Chen, devel,
Linux ACPI, Linux MM, linux-nvdimm@lists.01.org, Linux API,
linuxppc-dev
On Wed, Dec 27, 2017 at 10:10:34AM +0100, Brice Goglin wrote:
> > Perhaps we can enlist /proc/iomem or a similar enumeration interface
> > to tell userspace the NUMA node and whether the kernel thinks it has
> > better or worse performance characteristics relative to base
> > system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start
> > publishing absolute numbers in sysfs userspace will default to looking
> > for specific magic numbers in sysfs vs asking the kernel for memory
> > that has performance characteristics relative to base "System RAM". In
> > other words the absolute performance information that the HMAT
> > publishes is useful to the kernel, but it's not clear that userspace
> > needs that vs a relative indicator for making NUMA node preference
> > decisions.
>
> Some HPC users will benchmark the machine to discovery actual
> performance numbers anyway.
> However, most users won't do this. They will want to know relative
> performance of different nodes. If you normalize HMAT values by dividing
> them with system-RAM values, that's likely OK. If you just say "that
> node is faster than system RAM", it's not precise enough.
So "this memory has 800% bandwidth of normal" and "this memory has 70%
bandwidth of normal"?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
2017-12-30 6:58 ` Matthew Wilcox
@ 2017-12-30 9:19 ` Brice Goglin
0 siblings, 0 replies; 18+ messages in thread
From: Brice Goglin @ 2017-12-30 9:19 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Dan Williams, Ross Zwisler, Dave Hansen, Michal Hocko,
linux-kernel@vger.kernel.org, Anaczkowski, Lukasz, Box, David E,
Kogut, Jaroslaw, Koss, Marcin, Koziej, Artur, Lahtinen, Joonas,
Moore, Robert, Nachimuthu, Murugasamy, Odzioba, Lukasz,
Rafael J. Wysocki, Rafael J. Wysocki, Schmauss, Erik,
Verma, Vishal L, Zheng, Lv, Andrew Morton, Balbir Singh,
Jerome Glisse, John Hubbard, Len Brown, Tim Chen, devel,
Linux ACPI, Linux MM, linux-nvdimm@lists.01.org, Linux API,
linuxppc-dev
Le 30/12/2017 à 07:58, Matthew Wilcox a écrit :
> On Wed, Dec 27, 2017 at 10:10:34AM +0100, Brice Goglin wrote:
>>> Perhaps we can enlist /proc/iomem or a similar enumeration interface
>>> to tell userspace the NUMA node and whether the kernel thinks it has
>>> better or worse performance characteristics relative to base
>>> system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start
>>> publishing absolute numbers in sysfs userspace will default to looking
>>> for specific magic numbers in sysfs vs asking the kernel for memory
>>> that has performance characteristics relative to base "System RAM". In
>>> other words the absolute performance information that the HMAT
>>> publishes is useful to the kernel, but it's not clear that userspace
>>> needs that vs a relative indicator for making NUMA node preference
>>> decisions.
>> Some HPC users will benchmark the machine to discovery actual
>> performance numbers anyway.
>> However, most users won't do this. They will want to know relative
>> performance of different nodes. If you normalize HMAT values by dividing
>> them with system-RAM values, that's likely OK. If you just say "that
>> node is faster than system RAM", it's not precise enough.
> So "this memory has 800% bandwidth of normal" and "this memory has 70%
> bandwidth of normal"?
I guess that would work.
Brice
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
2017-12-20 18:19 ` [PATCH v3 0/3] create sysfs representation of ACPI HMAT Matthew Wilcox
2017-12-20 20:22 ` Dave Hansen
@ 2017-12-20 21:13 ` Ross Zwisler
2017-12-21 1:41 ` Elliott, Robert (Persistent Memory)
2017-12-21 12:50 ` Michael Ellerman
2 siblings, 1 reply; 18+ messages in thread
From: Ross Zwisler @ 2017-12-20 21:13 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Ross Zwisler, Michal Hocko, linux-kernel, Anaczkowski, Lukasz,
Box, David E, Kogut, Jaroslaw, Koss, Marcin, Koziej, Artur,
Lahtinen, Joonas, Moore, Robert, Nachimuthu, Murugasamy,
Odzioba, Lukasz, Rafael J. Wysocki, Rafael J. Wysocki,
Schmauss, Erik, Verma, Vishal L, Zheng, Lv, Andrew Morton,
Balbir Singh, Brice Goglin, Dan Williams, Dave Hansen,
Jerome Glisse, John Hubbard, Len Brown, Tim Chen, devel,
linux-acpi, linux-mm, linux-nvdimm, linux-api, linuxppc-dev
On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote:
> On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote:
> > What I'm hoping to do with this series is to just provide a sysfs
> > representation of the HMAT so that applications can know which NUMA nodes to
> > select with existing utilities like numactl. This series does not currently
> > alter any kernel behavior, it only provides a sysfs interface.
> >
> > Say for example you had a system with some high bandwidth memory (HBM), and
> > you wanted to use it for a specific application. You could use the sysfs
> > representation of the HMAT to figure out which memory target held your HBM.
> > You could do this by looking at the local bandwidth values for the various
> > memory targets, so:
> >
> > # grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps
> > /sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920
> > /sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960
> > /sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960
> > /sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960
> >
> > and look for the one that corresponds to your HBM speed. (These numbers are
> > made up, but you get the idea.)
>
> Presumably ACPI-based platforms will not be the only ones who have the
> ability to expose different bandwidth memories in the future. I think
> we need a platform-agnostic way ... right, PowerPC people?
Hey Matthew,
Yep, this is where I started as well. My plan with my initial implementation
was to try and make the sysfs representation as platform agnostic as possible,
and just have the ACPI HMAT as one of the many places to gather the data
needed to populate sysfs.
However, as I began coding the implementation became very specific to the
HMAT, probably because I don't know of way that this type of info is
represented on another platform. John Hubbard noticed the same thing and
asked me to s/HMEM/HMAT/ everywhere and just make it HMAT specific, and to
prevent it from being confused with the HMM work:
https://lkml.org/lkml/2017/7/7/33
https://lkml.org/lkml/2017/7/7/442
I'm open to making it more platform agnostic if I can get my hands on a
parallel effort in another platform and tease out the commonality, but trying
to do that without a second example hasn't worked out. If we don't have a
good second example right now I think maybe we should put this in and then
merge it with the second example when it comes along.
> I don't know what the right interface is, but my laptop has a set of
> /sys/devices/system/memory/memoryN/ directories. Perhaps this is the
> right place to expose write_bw (etc).
>
> > Once you know the NUMA node of your HBM, you can figure out the NUMA node of
> > it's local initiator:
> >
> > # ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init*
> > /sys/devices/system/hmat/mem_tgt2/local_init/mem_init0
> >
> > So, in our made-up example our HBM is located in numa node 2, and the local
> > CPU for that HBM is at numa node 0.
>
> initiator is a CPU? I'd have expected you to expose a memory controller
> abstraction rather than re-use storage terminology.
Yea, I agree that at first blush it seems weird. It turns out that looking at
it in sort of a storage initiator/target way is beneficial, though, because it
allows us to cut down on the number of data values we need to represent.
For example the SLIT, which doesn't differentiate between initiator and target
proximity domains (and thus nodes) always represents a system with N proximity
domains using a NxN distance table. This makes sense if every node contains
both CPUs and memory.
With the introduction of the HMAT, though, we can have memory-only initiator
nodes and we can explicitly associate them with their local CPU. This is
necessary so that we can separate memory with different performance
characteristics (HBM vs normal memory vs persistent memory, for example) that
are all attached to the same CPU.
So, say we now have a system with 4 CPUs, and each of those CPUs has 3
different types of memory attached to it. We now have 16 total proximity
domains, 4 CPU and 12 memory.
If we represent this with the SLIT we end up with a 16 X 16 distance table
(256 entries), most of which don't matter because they are memory-to-memory
distances which don't make sense.
In the HMAT, though, we separate out the initiators and the targets and put
them into separate lists. (See 5.2.27.4 System Locality Latency and Bandwidth
Information Structure in ACPI 6.2 for details.) So, this same config in the
HMAT only has 4*12=48 performance values of each type, all of which convey
meaningful information.
The HMAT indeed even uses the storage "initiator" and "target" terminology. :)
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
2017-12-20 21:13 ` Ross Zwisler
@ 2017-12-21 1:41 ` Elliott, Robert (Persistent Memory)
2017-12-22 21:46 ` Ross Zwisler
0 siblings, 1 reply; 18+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2017-12-21 1:41 UTC (permalink / raw)
To: Ross Zwisler, Matthew Wilcox
Cc: Michal Hocko, Box, David E, Dave Hansen, Zheng, Lv,
linux-nvdimm@lists.01.org, Rafael J. Wysocki, Anaczkowski, Lukasz,
Moore, Robert, linux-acpi@vger.kernel.org, Odzioba, Lukasz,
Schmauss, Erik, Len Brown, John Hubbard,
linuxppc-dev@lists.ozlabs.org, Jerome Glisse, devel@acpica.org,
Kogut, Jaroslaw, linux-mm@kvack.org, Koss, Marcin,
linux-api@vger.kernel.org, Brice Goglin, Nachimuthu, Murugasamy,
Rafael J. Wysocki, linux-kernel@vger.kernel.org, Koziej, Artur,
Lahtinen, Joonas, Andrew Morton, Tim Chen
> -----Original Message-----
> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf O=
f
> Ross Zwisler
...
>=20
> On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote:
...
> > initiator is a CPU? I'd have expected you to expose a memory controlle=
r
> > abstraction rather than re-use storage terminology.
>=20
> Yea, I agree that at first blush it seems weird. It turns out that
> looking at it in sort of a storage initiator/target way is beneficial,
> though, because it allows us to cut down on the number of data values
> we need to represent.
>=20
> For example the SLIT, which doesn't differentiate between initiator and
> target proximity domains (and thus nodes) always represents a system
> with N proximity domains using a NxN distance table. This makes sense
> if every node contains both CPUs and memory.
>=20
> With the introduction of the HMAT, though, we can have memory-only
> initiator nodes and we can explicitly associate them with their local=20
> CPU. This is necessary so that we can separate memory with different
> performance characteristics (HBM vs normal memory vs persistent memory,
> for example) that are all attached to the same CPU.
>=20
> So, say we now have a system with 4 CPUs, and each of those CPUs has 3
> different types of memory attached to it. We now have 16 total proximity
> domains, 4 CPU and 12 memory.
The CPU cores that make up a node can have performance restrictions of
their own; for example, they might max out at 10 GB/s even though the
memory controller supports 120 GB/s (meaning you need to use 12 cores
on the node to fully exercise memory). It'd be helpful to report this,
so software can decide how many cores to use for bandwidth-intensive work.
> If we represent this with the SLIT we end up with a 16 X 16 distance tabl=
e
> (256 entries), most of which don't matter because they are memory-to-
> memory distances which don't make sense.
>=20
> In the HMAT, though, we separate out the initiators and the targets and
> put them into separate lists. (See 5.2.27.4 System Locality Latency and
> Bandwidth Information Structure in ACPI 6.2 for details.) So, this same
> config in the HMAT only has 4*12=3D48 performance values of each type, al=
l
> of which convey meaningful information.
>=20
> The HMAT indeed even uses the storage "initiator" and "target"
> terminology. :)
Centralized DMA engines (e.g., as used by the "DMA based blk-mq pmem
driver") have performance differences too. A CPU might include
CPU cores that reach 10 GB/s, DMA engines that reach 60 GB/s, and
memory controllers that reach 120 GB/s. I guess these would be
represented as extra initiators on the node?
---
Robert Elliott, HPE Persistent Memory
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
2017-12-21 1:41 ` Elliott, Robert (Persistent Memory)
@ 2017-12-22 21:46 ` Ross Zwisler
0 siblings, 0 replies; 18+ messages in thread
From: Ross Zwisler @ 2017-12-22 21:46 UTC (permalink / raw)
To: Elliott, Robert (Persistent Memory)
Cc: Ross Zwisler, Matthew Wilcox, Michal Hocko, Box, David E,
Dave Hansen, Zheng, Lv, linux-nvdimm@lists.01.org,
Rafael J. Wysocki, Anaczkowski, Lukasz, Moore, Robert,
linux-acpi@vger.kernel.org, Odzioba, Lukasz, Schmauss, Erik,
Len Brown, John Hubbard, linuxppc-dev@lists.ozlabs.org,
Jerome Glisse, devel@acpica.org, Kogut, Jaroslaw,
linux-mm@kvack.org, Koss, Marcin, linux-api@vger.kernel.org,
Brice Goglin, Nachimuthu, Murugasamy, Rafael J. Wysocki,
linux-kernel@vger.kernel.org, Koziej, Artur, Lahtinen, Joonas,
Andrew Morton, Tim Chen
On Thu, Dec 21, 2017 at 01:41:15AM +0000, Elliott, Robert (Persistent Memory) wrote:
>
>
> > -----Original Message-----
> > From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf Of
> > Ross Zwisler
> ...
> >
> > On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote:
> ...
> > > initiator is a CPU? I'd have expected you to expose a memory controller
> > > abstraction rather than re-use storage terminology.
> >
> > Yea, I agree that at first blush it seems weird. It turns out that
> > looking at it in sort of a storage initiator/target way is beneficial,
> > though, because it allows us to cut down on the number of data values
> > we need to represent.
> >
> > For example the SLIT, which doesn't differentiate between initiator and
> > target proximity domains (and thus nodes) always represents a system
> > with N proximity domains using a NxN distance table. This makes sense
> > if every node contains both CPUs and memory.
> >
> > With the introduction of the HMAT, though, we can have memory-only
> > initiator nodes and we can explicitly associate them with their local
> > CPU. This is necessary so that we can separate memory with different
> > performance characteristics (HBM vs normal memory vs persistent memory,
> > for example) that are all attached to the same CPU.
> >
> > So, say we now have a system with 4 CPUs, and each of those CPUs has 3
> > different types of memory attached to it. We now have 16 total proximity
> > domains, 4 CPU and 12 memory.
>
> The CPU cores that make up a node can have performance restrictions of
> their own; for example, they might max out at 10 GB/s even though the
> memory controller supports 120 GB/s (meaning you need to use 12 cores
> on the node to fully exercise memory). It'd be helpful to report this,
> so software can decide how many cores to use for bandwidth-intensive work.
>
> > If we represent this with the SLIT we end up with a 16 X 16 distance table
> > (256 entries), most of which don't matter because they are memory-to-
> > memory distances which don't make sense.
> >
> > In the HMAT, though, we separate out the initiators and the targets and
> > put them into separate lists. (See 5.2.27.4 System Locality Latency and
> > Bandwidth Information Structure in ACPI 6.2 for details.) So, this same
> > config in the HMAT only has 4*12=48 performance values of each type, all
> > of which convey meaningful information.
> >
> > The HMAT indeed even uses the storage "initiator" and "target"
> > terminology. :)
>
> Centralized DMA engines (e.g., as used by the "DMA based blk-mq pmem
> driver") have performance differences too. A CPU might include
> CPU cores that reach 10 GB/s, DMA engines that reach 60 GB/s, and
> memory controllers that reach 120 GB/s. I guess these would be
> represented as extra initiators on the node?
For both of your comments I think all of this comes down to how you want to
represent your platform in the HMAT. The sysfs representation just shows you
what is in the HMAT.
Each initiator node is just a single NUMA node (think of it as a NUMA node
which has the characteristic that it can initiate memory requests), so I don't
think there is a way to have "extra initiators on the node". I think what
you're talking about is separating the DMA engines and CPU cores into separate
NUMA nodes, both of which are initiators. I think this is probably fine as it
conveys useful info.
I don't think the HMAT has a concept of increasing bandwidth for number of CPU
cores used - it just has a single bandwidth number (well, one for read and one
for write) per initiator/target pair. I don't think we want to add this,
either - the HMAT is already very complex.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
2017-12-20 18:19 ` [PATCH v3 0/3] create sysfs representation of ACPI HMAT Matthew Wilcox
2017-12-20 20:22 ` Dave Hansen
2017-12-20 21:13 ` Ross Zwisler
@ 2017-12-21 12:50 ` Michael Ellerman
2 siblings, 0 replies; 18+ messages in thread
From: Michael Ellerman @ 2017-12-21 12:50 UTC (permalink / raw)
To: Matthew Wilcox, Ross Zwisler
Cc: Michal Hocko, Box, David E, Dave Hansen, Zheng, Lv, linux-nvdimm,
Verma, Vishal L, Rafael J. Wysocki, Anaczkowski, Lukasz,
Moore, Robert, linux-acpi, Odzioba, Lukasz, Schmauss, Erik,
Len Brown, John Hubbard, linuxppc-dev, Jerome Glisse,
Dan Williams, devel, Kogut, Jaroslaw, linux-mm, Koss, Marcin,
linux-api, Brice Goglin, Nachimuthu, Murugasamy,
Rafael J. Wysocki, linux-kernel, Koziej, Artur, Lahtinen, Joonas,
Andrew Morton, Tim Chen
Matthew Wilcox <willy@infradead.org> writes:
> On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote:
>> What I'm hoping to do with this series is to just provide a sysfs
>> representation of the HMAT so that applications can know which NUMA nodes to
>> select with existing utilities like numactl. This series does not currently
>> alter any kernel behavior, it only provides a sysfs interface.
>>
>> Say for example you had a system with some high bandwidth memory (HBM), and
>> you wanted to use it for a specific application. You could use the sysfs
>> representation of the HMAT to figure out which memory target held your HBM.
>> You could do this by looking at the local bandwidth values for the various
>> memory targets, so:
>>
>> # grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps
>> /sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920
>> /sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960
>> /sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960
>> /sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960
>>
>> and look for the one that corresponds to your HBM speed. (These numbers are
>> made up, but you get the idea.)
>
> Presumably ACPI-based platforms will not be the only ones who have the
> ability to expose different bandwidth memories in the future. I think
> we need a platform-agnostic way ... right, PowerPC people?
Yes!
I don't have any detail at hand but will try and rustle something up.
cheers
^ permalink raw reply [flat|nested] 18+ messages in thread