linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jerome Glisse <jglisse@redhat.com>
To: Bob Liu <liubo95@huawei.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, John Hubbard <jhubbard@nvidia.com>,
	Dan Williams <dan.j.williams@intel.com>,
	David Nellans <dnellans@nvidia.com>,
	Balbir Singh <bsingharora@gmail.com>,
	majiuyue <majiuyue@huawei.com>,
	"xieyisheng (A)" <xieyisheng1@huawei.com>
Subject: Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
Date: Thu, 7 Sep 2017 13:00:25 -0400 (EDT)	[thread overview]
Message-ID: <623597181.10447378.1504803625592.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <4f4a2196-228d-5d54-5386-72c3ffb1481b@huawei.com>

> On 2017/9/6 10:12, Jerome Glisse wrote:
> > On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote:
> >> On 2017/9/6 2:54, Ross Zwisler wrote:
> >>> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
> >>>> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> >>>>> On 2017/9/4 23:51, Jerome Glisse wrote:
> >>>>>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> >>>>>>> On 2017/8/17 8:05, Jérôme Glisse wrote:
> >>>>>>>> Unlike unaddressable memory, coherent device memory has a real
> >>>>>>>> resource associated with it on the system (as CPU can address
> >>>>>>>> it). Add a new helper to hotplug such memory within the HMM
> >>>>>>>> framework.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Got an new question, coherent device( e.g CCIX) memory are likely
> >>>>>>> reported to OS
> >>>>>>> through ACPI and recognized as NUMA memory node.
> >>>>>>> Then how can their memory be captured and managed by HMM framework?
> >>>>>>>
> >>>>>>
> >>>>>> Only platform that has such memory today is powerpc and it is not
> >>>>>> reported
> >>>>>> as regular memory by the firmware hence why they need this helper.
> >>>>>>
> >>>>>> I don't think anyone has defined anything yet for x86 and acpi. As
> >>>>>> this is
> >>>>>
> >>>>> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> >>>>> Table (HMAT) table defined in ACPI 6.2.
> >>>>> The HMAT can cover CPU-addressable memory types(though not non-cache
> >>>>> coherent on-device memory).
> >>>>>
> >>>>> Ross from Intel already done some work on this, see:
> >>>>> https://lwn.net/Articles/724562/
> >>>>>
> >>>>> arm64 supports APCI also, there is likely more this kind of device when
> >>>>> CCIX
> >>>>> is out (should be very soon if on schedule).
> >>>>
> >>>> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy"
> >>>> memory ie
> >>>> when you have several kind of memory each with different
> >>>> characteristics:
> >>>>   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
> >>>>     small (ie few giga bytes)
> >>>>   - Persistent memory, slower (both latency and bandwidth) big (tera
> >>>>   bytes)
> >>>>   - DDR (good old memory) well characteristics are between HBM and
> >>>>   persistent
> >>>>
> >>>> So AFAICT this has nothing to do with what HMM is for, ie device memory.
> >>>> Note
> >>>> that device memory can have a hierarchy of memory themself (HBM, GDDR
> >>>> and in
> >>>> maybe even persistent memory).
> >>>>
> >>>>>> memory on PCIE like interface then i don't expect it to be reported as
> >>>>>> NUMA
> >>>>>> memory node but as io range like any regular PCIE resources. Device
> >>>>>> driver
> >>>>>> through capabilities flags would then figure out if the link between
> >>>>>> the
> >>>>>> device and CPU is CCIX capable if so it can use this helper to hotplug
> >>>>>> it
> >>>>>> as device memory.
> >>>>>>
> >>>>>
> >>>>> From my point of view,  Cache coherent device memory will popular soon
> >>>>> and
> >>>>> reported through ACPI/UEFI. Extending NUMA policy still sounds more
> >>>>> reasonable
> >>>>> to me.
> >>>>
> >>>> Cache coherent device will be reported through standard mecanisms
> >>>> defined by
> >>>> the bus standard they are using. To my knowledge all the standard are
> >>>> either
> >>>> on top of PCIE or are similar to PCIE.
> >>>>
> >>>> It is true that on many platform PCIE resource is manage/initialize by
> >>>> the
> >>>> bios (UEFI) but it is platform specific. In some case we reprogram what
> >>>> the
> >>>> bios pick.
> >>>>
> >>>> So like i was saying i don't expect the BIOS/UEFI to report device
> >>>> memory as
> >>>> regular memory. It will be reported as a regular PCIE resources and then
> >>>> the
> >>>> device driver will be able to determine through some flags if the link
> >>>> between
> >>>> the CPU(s) and the device is cache coherent or not. At that point the
> >>>> device
> >>>> driver can use register it with HMM helper.
> >>>>
> >>>>
> >>>> The whole NUMA discussion happen several time in the past i suggest
> >>>> looking
> >>>> on mm list archive for them. But it was rule out for several reasons.
> >>>> Top of
> >>>> my head:
> >>>>   - people hate CPU less node and device memory is inherently CPU less
> >>>
> >>> With the introduction of the HMAT in ACPI 6.2 one of the things that was
> >>> added
> >>> was the ability to have an ACPI proximity domain that isn't associated
> >>> with a
> >>> CPU.  This can be seen in the changes in the text of the "Proximity
> >>> Domain"
> >>> field in table 5-73 which describes the "Memory Affinity Structure".  One
> >>> of
> >>> the major features of the HMAT was the separation of "Initiator"
> >>> proximity
> >>> domains (CPUs, devices that initiate memory transfers), and "target"
> >>> proximity
> >>> domains (memory regions, be they attached to a CPU or some other device).
> >>>
> >>> ACPI proximity domains map directly to Linux NUMA nodes, so I think we're
> >>> already in a place where we have to support CPU-less NUMA nodes.
> >>>
> >>>>   - device driver want total control over memory and thus to be isolated
> >>>>   from
> >>>>     mm mecanism and doing all those special cases was not welcome
> >>>
> >>> I agree that the kernel doesn't have enough information to be able to
> >>> accurately handle all the use cases for the various types of
> >>> heterogeneous
> >>> memory.   The goal of my HMAT enabling is to allow that memory to be
> >>> reserved
> >>> from kernel use via the "Reservation Hint" in the HMAT's Memory Subsystem
> >>> Address Range Structure, then provide userspace with enough information
> >>> to be
> >>> able to distinguish between the various types of memory in the system so
> >>> it
> >>> can allocate & utilize it appropriately.
> >>>
> >>
> >> Does this mean require an user space memory management library to deal
> >> with all
> >> alloc/free/defragment.. But how to do with virtual <-> physical address
> >> mapping
> >> from userspace?
> > 
> > For HMM each process give hint (somewhat similar to mbind) for range of
> > virtual
> > address to the device kernel driver (through some API like OpenCL or CUDA
> > for GPU
> > for instance). All this being device driver specific ioctl.
> > 
> > The kernel device driver have an overall view of all the process that use
> > the device
> > and each of the memory advise they gave. From that informations the kernel
> > device
> > driver decide what part of each process address space to migrate to device
> > memory.
> 
> Oh, I mean CDM-HMM.  I'm fine with HMM.
> 
> > This obviously dynamic and likely to change over the process lifetime.
> > 
> > 
> > My understanding is that HMAT want similar API to allow process to give
> > direction on
> > where each range of virtual address should be allocated. It is expected
> > that most
> 
> Right, but not clear who should manage the physical memory allocation and
> setup the
> pagetable mapping. An new driver or the kernel?
> 
> > software can easily infer what part of its address will need more
> > bandwidth, smaller
> > latency versus what part is sparsely accessed ...
> > 
> > For HMAT i think first target is HBM and persistent memory and device
> > memory might
> > be added latter if that make sense.
> > 
> 
> Okay, so there are two potential ways for CPU-addressable cache-coherent
> device memory
> (or cpu-less numa memory or "target domain" memory in ACPI spec )?
> 1. CDM-HMM
> 2. HMAT
> 
> --
> Regards,
> Bob Liu
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-09-07 17:00 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 01/19] hmm: heterogeneous memory management documentation v3 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 02/19] mm/hmm: heterogeneous memory management (HMM for short) v5 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 03/19] mm/hmm/mirror: mirror process address space on device with HMM helpers v3 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 04/19] mm/hmm/mirror: helper to snapshot CPU page table v4 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 05/19] mm/hmm/mirror: device page fault handler Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 06/19] mm/memory_hotplug: introduce add_pages Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 07/19] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v5 Jérôme Glisse
2018-12-20  8:33   ` Dan Williams
2018-12-20 16:15     ` Jerome Glisse
2018-12-20 16:15       ` Jerome Glisse
2018-12-20 16:47       ` Dan Williams
2018-12-20 16:47         ` Dan Williams
2018-12-20 16:57         ` Jerome Glisse
2018-12-20 16:57           ` Jerome Glisse
2017-08-17  0:05 ` [HMM-v25 08/19] mm/ZONE_DEVICE: special case put_page() for device private pages v4 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 09/19] mm/memcontrol: allow to uncharge page without using page->lru field Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 10/19] mm/memcontrol: support MEMORY_DEVICE_PRIVATE v4 Jérôme Glisse
2017-09-05 17:13   ` Laurent Dufour
2017-09-05 17:21     ` Jerome Glisse
2017-08-17  0:05 ` [HMM-v25 11/19] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE v7 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 12/19] mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v3 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 13/19] mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY Jérôme Glisse
2017-08-17 21:12   ` Andrew Morton
2017-08-17 21:44     ` Jerome Glisse
2017-08-17  0:05 ` [HMM-v25 14/19] mm/migrate: new memory migration helper for use with device memory v5 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 15/19] mm/migrate: migrate_vma() unmap page from vma while collecting pages Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 16/19] mm/migrate: support un-addressable ZONE_DEVICE page in migration v3 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 17/19] mm/migrate: allow migrate_vma() to alloc new page on empty entry v4 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 18/19] mm/device-public-memory: device memory cache coherent with CPU v5 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3 Jérôme Glisse
2017-09-04  3:09   ` Bob Liu
2017-09-04 15:51     ` Jerome Glisse
2017-09-05  1:13       ` Bob Liu
2017-09-05  2:38         ` Jerome Glisse
2017-09-05  3:50           ` Bob Liu
2017-09-05 13:50             ` Jerome Glisse
2017-09-05 16:18               ` Dan Williams
2017-09-05 19:00               ` Ross Zwisler
2017-09-05 19:20                 ` Jerome Glisse
2017-09-08 19:43                   ` Ross Zwisler
2017-09-08 20:29                     ` Jerome Glisse
2017-09-05 18:54           ` Ross Zwisler
2017-09-06  1:25             ` Bob Liu
2017-09-06  2:12               ` Jerome Glisse
2017-09-07  2:06                 ` Bob Liu
2017-09-07 17:00                   ` Jerome Glisse [this message]
2017-09-07 17:27                   ` Jerome Glisse
2017-09-08  1:59                     ` Bob Liu
2017-09-08 20:43                       ` Dan Williams
2017-11-17  3:47                         ` chetan L
2017-09-05  3:36       ` Balbir Singh
2017-08-17 21:39 ` [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Andrew Morton
2017-08-17 21:55   ` Jerome Glisse
2017-08-17 21:59     ` Dan Williams
2017-08-17 22:02       ` Jerome Glisse
2017-08-17 22:06         ` Dan Williams
2017-08-17 22:16       ` Andrew Morton
2017-12-13 12:10 ` Figo.zhang
2017-12-13 16:12   ` Jerome Glisse
2017-12-14  2:48     ` Figo.zhang
2017-12-14  3:16       ` Jerome Glisse
2017-12-14  3:53         ` Figo.zhang
2017-12-14  4:16           ` Jerome Glisse
2017-12-14  7:05             ` Figo.zhang
2017-12-14 15:28               ` Jerome Glisse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=623597181.10447378.1504803625592.JavaMail.zimbra@redhat.com \
    --to=jglisse@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=bsingharora@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=dnellans@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liubo95@huawei.com \
    --cc=majiuyue@huawei.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=xieyisheng1@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).