All of lore.kernel.org
 help / color / mirror / Atom feed
From: Randy Dunlap <rdunlap@infradead.org>
To: "Jérôme Glisse" <jglisse@redhat.com>,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	joro@8bytes.org, Mel Gorman <mgorman@suse.de>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Johannes Weiner <jweiner@redhat.com>,
	Larry Woodman <lwoodman@redhat.com>,
	Rik van Riel <riel@redhat.com>, Dave Airlie <airlied@redhat.com>,
	Brendan Conoboy <blc@redhat.com>,
	Joe Donohue <jdonohue@redhat.com>,
	Christophe Harle <charle@nvidia.com>,
	Duncan Poole <dpoole@nvidia.com>,
	Sherry Cheung <SCheung@nvidia.com>,
	Subhash Gutti <sgutti@nvidia.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Mark Hairgrove <mhairgrove@nvidia.com>,
	Lucien Dunning <ldunning@nvidia.com>,
	Cameron Buschardt <cabuschardt@nvidia.com>,
	Arvind Gopalakrishnan <arvindg@nvidia.com>,
	Haggai Eran <haggaie@mellanox.com>,
	Shachar Raindel <raindel@mellanox.com>,
	Liran Liss <liranl@mellanox.com>,
	Roland Dreier <roland@purestorage.com>,
	Ben Sander <ben.sander@amd.com>,
	Greg Stoner <Greg.Stoner@amd.com>,
	John Bridgman <John.Bridgman@amd.com>,
	Michael Mantor <Michael.Mantor@amd.com>,
	Paul Blinzer <Paul.Blinzer@amd.com>,
	Leonid Shamis <Leonid.Shamis@amd.com>,
	Laurent Morichetti <Laurent.Morichetti@amd.com>,
	Alexander Deucher <Alexander.Deucher@amd.com>
Subject: Re: [PATCH v11 15/15] HMM: add documentation explaining HMM internals and how to use it.
Date: Wed, 21 Oct 2015 20:23:41 -0700	[thread overview]
Message-ID: <562856BD.3020806@infradead.org> (raw)
In-Reply-To: <1445461210-2605-16-git-send-email-jglisse@redhat.com>

Hi,

Some corrections and a few questions...

On 10/21/15 14:00, JA(C)rA'me Glisse wrote:
> This add documentation on how HMM works and a more in depth view of how it
> should be use by device driver writers.
> 
> Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
> ---
>  Documentation/vm/hmm.txt | 219 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 219 insertions(+)
>  create mode 100644 Documentation/vm/hmm.txt
> 
> diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
> new file mode 100644
> index 0000000..febed50
> --- /dev/null
> +++ b/Documentation/vm/hmm.txt
> @@ -0,0 +1,219 @@
> +Heterogeneous Memory Management (HMM)
> +-------------------------------------
> +
> +The raison d'i? 1/2 tre of HMM is to provide a common API for device driver that

                                                                    drivers

> +wants to mirror a process address space on there device and/or migrate system

   want                                       their

> +memory to device memory. Device driver can decide to only use one aspect of

                                   drivers

> +HMM (mirroring or memory migration), for instance some device can directly
> +access process address space through hardware (for instance PCIe ATS/PASID),
> +but still want to benefit from memory migration capabilities that HMM offer.
> +
> +While HMM rely on existing kernel infrastructure (namely mmu_notifier) some

             relies

> +of its features (memory migration, atomic access) require integration with
> +core mm kernel code. Having HMM as the common intermediary is more appealing

        MM

> +than having each device driver hooking itself inside the common mm code.

                                                                   MM

> +
> +Moreover HMM as a layer allows integration with DMA API or page reclaimation.

                                                                   reclamation.

> +
> +
> +Mirroring address space on the device:
> +--------------------------------------
> +
> +Device that can't directly access transparently the process address space, need
> +to mirror the CPU page table into there own page table. HMM helps to keep the

                                     their

> +device page table synchronize with the CPU page table. It is not expected that

                     synchronized

> +the device will fully mirror the CPU page table but only mirror region that are

                                                                   regions

> +actively accessed by the device. For that reasons HMM only helps populating and

                                             reason

> +synchronizing device page table for range that the device driver explicitly ask

                                       ranges                                  asks

or is only one range supported?


> +for.
> +
> +Mirroring address space inside the device page table is easy with HMM :

                                                                     HMM:

> +
> +  /* Create a mirror for the current process for your device. */
> +  your_hmm_mirror->hmm_mirror.device = your_hmm_device;
> +  hmm_mirror_register(&your_hmm_mirror->hmm_mirror);
> +
> +  ...
> +
> +  /* Mirror memory (in read mode) between addressA and addressB */
> +  your_hmm_event->hmm_event.start = addressA;
> +  your_hmm_event->hmm_event.end = addressB;

Multiple events (ranges) can be specified?
Is hmm_event.end (addressB) included or excluded from the range?

> +  your_hmm_event->hmm_event.etype = HMM_DEVICE_RFAULT;
> +  hmm_mirror_fault(&your_hmm_mirror->hmm_mirror, &your_hmm_event->hmm_event);
> +    /* HMM callback into your driver with the >update() callback. During the
> +     * callback use the HMM page table to populate the device page table. You
> +     * can only use the HMM page table to populate the device page table for
> +     * the specified range during the >update() callback, at any other point in
> +     * time the HMM page table content should be assume to be undefined.

                                                    assumed

> +     */
> +    your_hmm_device->update(mirror, event);
> +
> +  ...
> +
> +  /* Process is quiting or device done stop the mirroring and cleanup. */

                   quitting or device done; stop

> +  hmm_mirror_unregister(&your_hmm_mirror->hmm_mirror);
> +  /* Device driver can free your_hmm_mirror */
> +
> +
> +HMM mirror page table:
> +----------------------
> +
> +Each hmm_mirror object is associated with a mirror page table that HMM keeps
> +synchronize with the CPU page table by using the mmu_notifier API. HMM is using

   synchronized

> +its own generic page table format because it needs to store DMA address, which

                                                                   adresses,

> +are bigger than long on some architecture, and have more flags per entry than

                                architectures,

> +radix tree allows.
> +
> +The HMM page table mostly mirror x86 page table layout. A page holds a global

                             mirrors

> +directory and each entry points to a lower level directory. Unlike regular CPU
> +page table, directory level are more aggressively freed and remove from the HMM

        tables,          levels                                removed

> +mirror page table. This means device driver needs to use the HMM helpers and to

                                        drivers need

> +follow directive on when and how to access the mirror page table. HMM use the

                                                                         uses

> +per page spinlock of directory page to synchronize update of directory ie update

                                  pages                         directory, i.e.,

> +can happen on different directory concurently.

                                     concurrently.

> +
> +As a rules the mirror page table can only be accessed by device driver from one

        rule                                             by a device driver

> +of the HMM device callback. Any access from outside a callback is illegal and

                     callbacks.

> +gives undertimed result.

         undetermined
or       undefined

> +
> +Accessing the mirror page table from a device callback needs to use the HMM
> +page table helpers. A loop to access entry for a range of address looks like :

                                        entries              addresses looks like:

> +
> +  /* Initialize a HMM page table iterator. */

                   an HMM

> +  struct hmm_pt_iter iter;
> +  hmm_pt_iter_init(&iter, &mirror->pt)
> +
> +  /* Get pointer to HMM page table entry for a given address. */
> +  dma_addr_t *hmm_pte;
> +  hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);

what are 'addr' and 'next'? (types)

> +
> +If there is no valid entry directory for given range address then hmm_pte is
> +NULL. If there is a valid entry directory then you can access the hmm_pte and
> +the pointer will stay valid as long as you do not call hmm_pt_iter_walk() with
> +the same iter struct for a different address or call hmm_pt_iter_fini().
> +
> +While the HMM page table entry pointer stays valid you can only modify the
> +value it is pointing to by using one of HMM helpers (hmm_pte_*()) as other
> +threads might be updating the same entry concurrently. The device driver only
> +need to update an HMM page table entry to set the dirty bit, so driver should

   needs                                                           drivers

> +only be using hmm_pte_set_dirty().
> +
> +Similarly to extract information the device driver should use one of the helper

                                                                            helpers

> +like hmm_pte_dma_addr() or hmm_pte_pfn() (if HMM is not doing DMA mapping which
> +is a device driver at initialization parameter).
> +
> +
> +Migrating system memory to device memory:
> +-----------------------------------------
> +
> +Device like discret GPU often have there own local memory which offer bigger

   Devices     discrete GPUs          their

> +bandwidth and smaller latency than access to system memory for the GPU. This
> +local memory is not necessarily accessible by the CPU. Device local memory will
> +remain revealent for the foreseeable future as bandwidth of GPU memory keep

          relevant                                                        keeps

> +increasing faster than bandwidth of system memory and as latency of PCIe does
> +not decrease.
> +
> +Thus to maximize use of device like GPU, program need to use the device memory.

                           devices like GPUs, programs

> +Userspace API wants to make this as transparent as it can be, so that there is
> +no need for complex modification of applications.
> +
> +Transparent use of device memory for range of address of a process require core

                                                                      requires

> +mm code modifications. Adding a new memory zone for devices memory did not make

   MM                                                  device

> +sense given that such memory is often only accessible by the device only. This
> +is why we decided to use a special kind of swap, migrated memory is mark as a

                                              swap;                    marked

> +special swap entry inside the CPU page table.
> +
> +While HMM handles the migration process, it does not decide what range or when
> +to migrate memory. The decision to perform such migration is under the control
> +of the device driver. Migration back to system memory happens either because
> +the CPU try to access the memory or because device driver decided to migrate

           tries

> +the memory back.
> +
> +
> +  /* Migrate system memory between addressA and addressB to device memory. */
> +  your_hmm_event->hmm_event.start = addressA;
> +  your_hmm_event->hmm_event.end = addressB;

is hmm_event.end (addressB) inclusive and exclusive?
i.e., is it end_of_copy + 1?
i.e., is the size of the copy addressB - addressA or
      addressB - addressA + 1?
i.e., is addressB = addressA + size
or is    addressB = addressA + size - 1

In my experience it is usually better to have a start_address and size
instead of start_address and end_address.

> +  your_hmm_event->hmm_event.etype = HMM_COPY_TO_DEVICE;
> +  hmm_mirror_fault(&your_hmm_mirror->hmm_mirror, &your_hmm_event->hmm_event);
> +    /* HMM callback into your driver with the >copy_to_device() callback.
> +     * Device driver must allocate device memory, DMA system memory to device
> +     * memory, update the device page table to point to device memory and
> +     * return. See hmm.h for details instructions and how failure are handled.

                                detailed                     failures

> +     */
> +    your_hmm_device->copy_to_device(mirror, event, dst, addressA, addressB);
> +
> +
> +Right now HMM only support migrating anonymous private memory. Migration of

                      supports

> +share memory and more generaly file mapped memory is on the road map.

   shared                generally

> +
> +
> +Locking consideration and overall design:
> +-----------------------------------------
> +
> +As a rule HMM will handle proper locking on the behalf of the device driver,
> +as such device driver does not need to take any mm lock before calling into

                                                   MM

> +the HMM code.
> +
> +HMM is also responsible of the hmm_device and hmm_mirror object lifetime. The

                           for

> +device driver can only free those after calling hmm_device_unregister() or
> +hmm_mirror_unregister() respectively.
> +
> +All the lock inside any of the HMM structure should never be use by the device

           locks                      structures

> +driver. They are intended to be use only and only by HMM code. Below is short

                                   used only by the HMM code.

> +description of the 3 main locks that exist for HMM internal use. Educational
> +purpose only.
> +
> +Each process mm has one and only one struct hmm associated with it. Each hmm

                MM

> +struct can be use by several different mirror. There is one and only one mirror

                                          mirrors.

> +per mm and device pair. So in essence the hmm struct is the core that dispatch

       MM                                                                dispatches

> +everything to every single mirror, each of them corresponding to a specific
> +device. The list of mirror for an hmm struct is protected by a semaphore as it

                       mirrors
> +sees mostly read access.
> +
> +Each time a device fault a range of address it calls hmm_mirror_fault(), HMM

                      faults

> +keeps track, inside the hmm struct, of each range currently being faulted. It
> +does that so it can synchronize with any CPU page table update. If there is a
> +CPU page table update then a callback through mmu_notifier will happen and HMM
> +will try to interrupt the device page fault that conflict (ie address range

                                                    conflicts (i.e.,

> +overlap with the range being updated) and wait for them to back off. This
> +insure that at no point in time the device driver see transient page table

   insures                                           sees

> +information. The list of active fault is protected by a spinlock, query on

                                   faults                  spinlock;

> +that list should be short and quick (we haven't gather enough statistic on

                                                   gathered      statistics

> +that side yet to have a good idea of the average access pattern).
> +
> +Each device driver wanting to use HMM must register one and only one hmm_device
> +struct per physical device with HMM. The hmm_device struct have pointer to the

                                                              has

> +device driver call back and keeps track of active mirrors for a given device.

                 callback

> +The active mirrors list is protected by a spinlock.
> +
> +
> +Future work:
> +------------
> +
> +Improved atomic access by the device to system memory. Some platform bus (PCIe)

                                                                        busses

> +offer limited number of atomic memory operations, some platform do not even

                                         operations;      platforms

> +have any kind of atomic memory operations by a device. In order to allow such
> +atomic operation we want to map page read only the CPU while the device perform

          operations               pages read-only in the CPU              performs

> +its operation. For this we need a new case inside the CPU write fault code path
> +to synchronize with the device.
> +
> +We want to allow program to lock a range of memory inside device memory and

              allow a program

> +forbid CPU access while the memory is lock inside the device. Any CPU access

                                         locked

> +to locked range would result in SIGBUS. We think that madvise() would be the
> +right syscall into which we could plug that feature.
> +
> +In order to minimize kernel memory consumption and overhead of DMA mapping, we
> +want to introduce new DMA API that allows to manage mapping on IOMMU directory
> +page basis. This would allow to map/unmap/update DMA mapping in bulk and
> +minimize IOMMU update and flushing overhead. Moreover this would allow to
> +improve IOMMU bad access reporting for DMA address inside those directory.
> +
> +Because update to the device page table might require "heavy" synchronization
> +with the device, the mmu_notifier callback might have to sleep while HMM is
> +waiting for the device driver to report device page table update completion.
> +This is especialy bad if this happens during page reclaimation, this might

           especially                                reclamation;

> +bring the system to pause. We want to mitigate this, either by maintaining a
> +new intermediate lru level in which we put pages actively mirrored by a device

                    LRU

> +or by some other mecanism. For time being we advice that device driver that

                    mechanism.                  advise

> +use HMM explicitly explain this corner case so that user are aware that this

                                                       users

> +can happens if there is memory pressure.

       happen
> 


-- 
~Randy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Randy Dunlap <rdunlap@infradead.org>
To: "Jérôme Glisse" <jglisse@redhat.com>,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	joro@8bytes.org, Mel Gorman <mgorman@suse.de>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Johannes Weiner <jweiner@redhat.com>,
	Larry Woodman <lwoodman@redhat.com>,
	Rik van Riel <riel@redhat.com>, Dave Airlie <airlied@redhat.com>,
	Brendan Conoboy <blc@redhat.com>,
	Joe Donohue <jdonohue@redhat.com>,
	Christophe Harle <charle@nvidia.com>,
	Duncan Poole <dpoole@nvidia.com>,
	Sherry Cheung <SCheung@nvidia.com>,
	Subhash Gutti <sgutti@nvidia.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Mark Hairgrove <mhairgrove@nvidia.com>,
	Lucien Dunning <ldunning@nvidia.com>,
	Cameron Buschardt <cabuschardt@nvidia.com>,
	Arvind Gopalakrishnan <arvindg@nvidia.com>,
	Haggai Eran <haggaie@mellanox.com>,
	Shachar Raindel <raindel@mellanox.com>,
	Liran Liss <liranl@mellanox.com>,
	Roland Dreier <roland@purestorage.com>,
	Ben Sander <ben.sander@amd.com>,
	Greg Stoner <Greg.Stoner@amd.com>,
	John Bridgman <John.Bridgman@amd.com>,
	Michael Mantor <Michael.Mantor@amd.com>,
	Paul Blinzer <Paul.Blinzer@amd.com>,
	Leonid Shamis <Leonid.Shamis@amd.com>,
	Laurent Morichetti <Laurent.Morichetti@amd.com>,
	Alexander Deucher <Alexander.Deucher@amd.com>
Subject: Re: [PATCH v11 15/15] HMM: add documentation explaining HMM internals and how to use it.
Date: Wed, 21 Oct 2015 20:23:41 -0700	[thread overview]
Message-ID: <562856BD.3020806@infradead.org> (raw)
In-Reply-To: <1445461210-2605-16-git-send-email-jglisse@redhat.com>

Hi,

Some corrections and a few questions...

On 10/21/15 14:00, Jérôme Glisse wrote:
> This add documentation on how HMM works and a more in depth view of how it
> should be use by device driver writers.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> ---
>  Documentation/vm/hmm.txt | 219 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 219 insertions(+)
>  create mode 100644 Documentation/vm/hmm.txt
> 
> diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
> new file mode 100644
> index 0000000..febed50
> --- /dev/null
> +++ b/Documentation/vm/hmm.txt
> @@ -0,0 +1,219 @@
> +Heterogeneous Memory Management (HMM)
> +-------------------------------------
> +
> +The raison d'�tre of HMM is to provide a common API for device driver that

                                                                    drivers

> +wants to mirror a process address space on there device and/or migrate system

   want                                       their

> +memory to device memory. Device driver can decide to only use one aspect of

                                   drivers

> +HMM (mirroring or memory migration), for instance some device can directly
> +access process address space through hardware (for instance PCIe ATS/PASID),
> +but still want to benefit from memory migration capabilities that HMM offer.
> +
> +While HMM rely on existing kernel infrastructure (namely mmu_notifier) some

             relies

> +of its features (memory migration, atomic access) require integration with
> +core mm kernel code. Having HMM as the common intermediary is more appealing

        MM

> +than having each device driver hooking itself inside the common mm code.

                                                                   MM

> +
> +Moreover HMM as a layer allows integration with DMA API or page reclaimation.

                                                                   reclamation.

> +
> +
> +Mirroring address space on the device:
> +--------------------------------------
> +
> +Device that can't directly access transparently the process address space, need
> +to mirror the CPU page table into there own page table. HMM helps to keep the

                                     their

> +device page table synchronize with the CPU page table. It is not expected that

                     synchronized

> +the device will fully mirror the CPU page table but only mirror region that are

                                                                   regions

> +actively accessed by the device. For that reasons HMM only helps populating and

                                             reason

> +synchronizing device page table for range that the device driver explicitly ask

                                       ranges                                  asks

or is only one range supported?


> +for.
> +
> +Mirroring address space inside the device page table is easy with HMM :

                                                                     HMM:

> +
> +  /* Create a mirror for the current process for your device. */
> +  your_hmm_mirror->hmm_mirror.device = your_hmm_device;
> +  hmm_mirror_register(&your_hmm_mirror->hmm_mirror);
> +
> +  ...
> +
> +  /* Mirror memory (in read mode) between addressA and addressB */
> +  your_hmm_event->hmm_event.start = addressA;
> +  your_hmm_event->hmm_event.end = addressB;

Multiple events (ranges) can be specified?
Is hmm_event.end (addressB) included or excluded from the range?

> +  your_hmm_event->hmm_event.etype = HMM_DEVICE_RFAULT;
> +  hmm_mirror_fault(&your_hmm_mirror->hmm_mirror, &your_hmm_event->hmm_event);
> +    /* HMM callback into your driver with the >update() callback. During the
> +     * callback use the HMM page table to populate the device page table. You
> +     * can only use the HMM page table to populate the device page table for
> +     * the specified range during the >update() callback, at any other point in
> +     * time the HMM page table content should be assume to be undefined.

                                                    assumed

> +     */
> +    your_hmm_device->update(mirror, event);
> +
> +  ...
> +
> +  /* Process is quiting or device done stop the mirroring and cleanup. */

                   quitting or device done; stop

> +  hmm_mirror_unregister(&your_hmm_mirror->hmm_mirror);
> +  /* Device driver can free your_hmm_mirror */
> +
> +
> +HMM mirror page table:
> +----------------------
> +
> +Each hmm_mirror object is associated with a mirror page table that HMM keeps
> +synchronize with the CPU page table by using the mmu_notifier API. HMM is using

   synchronized

> +its own generic page table format because it needs to store DMA address, which

                                                                   adresses,

> +are bigger than long on some architecture, and have more flags per entry than

                                architectures,

> +radix tree allows.
> +
> +The HMM page table mostly mirror x86 page table layout. A page holds a global

                             mirrors

> +directory and each entry points to a lower level directory. Unlike regular CPU
> +page table, directory level are more aggressively freed and remove from the HMM

        tables,          levels                                removed

> +mirror page table. This means device driver needs to use the HMM helpers and to

                                        drivers need

> +follow directive on when and how to access the mirror page table. HMM use the

                                                                         uses

> +per page spinlock of directory page to synchronize update of directory ie update

                                  pages                         directory, i.e.,

> +can happen on different directory concurently.

                                     concurrently.

> +
> +As a rules the mirror page table can only be accessed by device driver from one

        rule                                             by a device driver

> +of the HMM device callback. Any access from outside a callback is illegal and

                     callbacks.

> +gives undertimed result.

         undetermined
or       undefined

> +
> +Accessing the mirror page table from a device callback needs to use the HMM
> +page table helpers. A loop to access entry for a range of address looks like :

                                        entries              addresses looks like:

> +
> +  /* Initialize a HMM page table iterator. */

                   an HMM

> +  struct hmm_pt_iter iter;
> +  hmm_pt_iter_init(&iter, &mirror->pt)
> +
> +  /* Get pointer to HMM page table entry for a given address. */
> +  dma_addr_t *hmm_pte;
> +  hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);

what are 'addr' and 'next'? (types)

> +
> +If there is no valid entry directory for given range address then hmm_pte is
> +NULL. If there is a valid entry directory then you can access the hmm_pte and
> +the pointer will stay valid as long as you do not call hmm_pt_iter_walk() with
> +the same iter struct for a different address or call hmm_pt_iter_fini().
> +
> +While the HMM page table entry pointer stays valid you can only modify the
> +value it is pointing to by using one of HMM helpers (hmm_pte_*()) as other
> +threads might be updating the same entry concurrently. The device driver only
> +need to update an HMM page table entry to set the dirty bit, so driver should

   needs                                                           drivers

> +only be using hmm_pte_set_dirty().
> +
> +Similarly to extract information the device driver should use one of the helper

                                                                            helpers

> +like hmm_pte_dma_addr() or hmm_pte_pfn() (if HMM is not doing DMA mapping which
> +is a device driver at initialization parameter).
> +
> +
> +Migrating system memory to device memory:
> +-----------------------------------------
> +
> +Device like discret GPU often have there own local memory which offer bigger

   Devices     discrete GPUs          their

> +bandwidth and smaller latency than access to system memory for the GPU. This
> +local memory is not necessarily accessible by the CPU. Device local memory will
> +remain revealent for the foreseeable future as bandwidth of GPU memory keep

          relevant                                                        keeps

> +increasing faster than bandwidth of system memory and as latency of PCIe does
> +not decrease.
> +
> +Thus to maximize use of device like GPU, program need to use the device memory.

                           devices like GPUs, programs

> +Userspace API wants to make this as transparent as it can be, so that there is
> +no need for complex modification of applications.
> +
> +Transparent use of device memory for range of address of a process require core

                                                                      requires

> +mm code modifications. Adding a new memory zone for devices memory did not make

   MM                                                  device

> +sense given that such memory is often only accessible by the device only. This
> +is why we decided to use a special kind of swap, migrated memory is mark as a

                                              swap;                    marked

> +special swap entry inside the CPU page table.
> +
> +While HMM handles the migration process, it does not decide what range or when
> +to migrate memory. The decision to perform such migration is under the control
> +of the device driver. Migration back to system memory happens either because
> +the CPU try to access the memory or because device driver decided to migrate

           tries

> +the memory back.
> +
> +
> +  /* Migrate system memory between addressA and addressB to device memory. */
> +  your_hmm_event->hmm_event.start = addressA;
> +  your_hmm_event->hmm_event.end = addressB;

is hmm_event.end (addressB) inclusive and exclusive?
i.e., is it end_of_copy + 1?
i.e., is the size of the copy addressB - addressA or
      addressB - addressA + 1?
i.e., is addressB = addressA + size
or is    addressB = addressA + size - 1

In my experience it is usually better to have a start_address and size
instead of start_address and end_address.

> +  your_hmm_event->hmm_event.etype = HMM_COPY_TO_DEVICE;
> +  hmm_mirror_fault(&your_hmm_mirror->hmm_mirror, &your_hmm_event->hmm_event);
> +    /* HMM callback into your driver with the >copy_to_device() callback.
> +     * Device driver must allocate device memory, DMA system memory to device
> +     * memory, update the device page table to point to device memory and
> +     * return. See hmm.h for details instructions and how failure are handled.

                                detailed                     failures

> +     */
> +    your_hmm_device->copy_to_device(mirror, event, dst, addressA, addressB);
> +
> +
> +Right now HMM only support migrating anonymous private memory. Migration of

                      supports

> +share memory and more generaly file mapped memory is on the road map.

   shared                generally

> +
> +
> +Locking consideration and overall design:
> +-----------------------------------------
> +
> +As a rule HMM will handle proper locking on the behalf of the device driver,
> +as such device driver does not need to take any mm lock before calling into

                                                   MM

> +the HMM code.
> +
> +HMM is also responsible of the hmm_device and hmm_mirror object lifetime. The

                           for

> +device driver can only free those after calling hmm_device_unregister() or
> +hmm_mirror_unregister() respectively.
> +
> +All the lock inside any of the HMM structure should never be use by the device

           locks                      structures

> +driver. They are intended to be use only and only by HMM code. Below is short

                                   used only by the HMM code.

> +description of the 3 main locks that exist for HMM internal use. Educational
> +purpose only.
> +
> +Each process mm has one and only one struct hmm associated with it. Each hmm

                MM

> +struct can be use by several different mirror. There is one and only one mirror

                                          mirrors.

> +per mm and device pair. So in essence the hmm struct is the core that dispatch

       MM                                                                dispatches

> +everything to every single mirror, each of them corresponding to a specific
> +device. The list of mirror for an hmm struct is protected by a semaphore as it

                       mirrors
> +sees mostly read access.
> +
> +Each time a device fault a range of address it calls hmm_mirror_fault(), HMM

                      faults

> +keeps track, inside the hmm struct, of each range currently being faulted. It
> +does that so it can synchronize with any CPU page table update. If there is a
> +CPU page table update then a callback through mmu_notifier will happen and HMM
> +will try to interrupt the device page fault that conflict (ie address range

                                                    conflicts (i.e.,

> +overlap with the range being updated) and wait for them to back off. This
> +insure that at no point in time the device driver see transient page table

   insures                                           sees

> +information. The list of active fault is protected by a spinlock, query on

                                   faults                  spinlock;

> +that list should be short and quick (we haven't gather enough statistic on

                                                   gathered      statistics

> +that side yet to have a good idea of the average access pattern).
> +
> +Each device driver wanting to use HMM must register one and only one hmm_device
> +struct per physical device with HMM. The hmm_device struct have pointer to the

                                                              has

> +device driver call back and keeps track of active mirrors for a given device.

                 callback

> +The active mirrors list is protected by a spinlock.
> +
> +
> +Future work:
> +------------
> +
> +Improved atomic access by the device to system memory. Some platform bus (PCIe)

                                                                        busses

> +offer limited number of atomic memory operations, some platform do not even

                                         operations;      platforms

> +have any kind of atomic memory operations by a device. In order to allow such
> +atomic operation we want to map page read only the CPU while the device perform

          operations               pages read-only in the CPU              performs

> +its operation. For this we need a new case inside the CPU write fault code path
> +to synchronize with the device.
> +
> +We want to allow program to lock a range of memory inside device memory and

              allow a program

> +forbid CPU access while the memory is lock inside the device. Any CPU access

                                         locked

> +to locked range would result in SIGBUS. We think that madvise() would be the
> +right syscall into which we could plug that feature.
> +
> +In order to minimize kernel memory consumption and overhead of DMA mapping, we
> +want to introduce new DMA API that allows to manage mapping on IOMMU directory
> +page basis. This would allow to map/unmap/update DMA mapping in bulk and
> +minimize IOMMU update and flushing overhead. Moreover this would allow to
> +improve IOMMU bad access reporting for DMA address inside those directory.
> +
> +Because update to the device page table might require "heavy" synchronization
> +with the device, the mmu_notifier callback might have to sleep while HMM is
> +waiting for the device driver to report device page table update completion.
> +This is especialy bad if this happens during page reclaimation, this might

           especially                                reclamation;

> +bring the system to pause. We want to mitigate this, either by maintaining a
> +new intermediate lru level in which we put pages actively mirrored by a device

                    LRU

> +or by some other mecanism. For time being we advice that device driver that

                    mechanism.                  advise

> +use HMM explicitly explain this corner case so that user are aware that this

                                                       users

> +can happens if there is memory pressure.

       happen
> 


-- 
~Randy

  reply	other threads:[~2015-10-22  3:23 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-21 20:59 [PATCH v11 00/15] HMM (Heterogeneous Memory Management) Jérôme Glisse
2015-10-21 20:59 ` Jérôme Glisse
2015-10-21 20:59 ` [PATCH v11 01/15] mmu_notifier: add event information to address invalidation v8 Jérôme Glisse
2015-10-21 20:59   ` Jérôme Glisse
2015-10-21 20:59 ` [PATCH v11 02/15] mmu_notifier: keep track of active invalidation ranges v5 Jérôme Glisse
2015-10-21 20:59   ` Jérôme Glisse
2015-10-21 20:59 ` [PATCH v11 03/15] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page() v2 Jérôme Glisse
2015-10-21 20:59   ` Jérôme Glisse
2015-10-21 20:59 ` [PATCH v11 04/15] mmu_notifier: allow range invalidation to exclude a specific mmu_notifier Jérôme Glisse
2015-10-21 20:59   ` Jérôme Glisse
2015-10-21 21:00 ` [PATCH v11 05/15] HMM: introduce heterogeneous memory management v5 Jérôme Glisse
2015-10-21 21:00   ` Jérôme Glisse
2015-10-21 20:18   ` Randy Dunlap
2015-10-21 20:18     ` Randy Dunlap
2015-10-21 21:00 ` [PATCH v11 06/15] HMM: add HMM page table v4 Jérôme Glisse
2015-10-21 21:00   ` Jérôme Glisse
2015-10-21 21:00 ` [PATCH v11 07/15] HMM: add per mirror " Jérôme Glisse
2015-10-21 21:00   ` Jérôme Glisse
2015-10-21 21:00 ` [PATCH v11 08/15] HMM: add device page fault support v6 Jérôme Glisse
2015-10-21 21:00   ` Jérôme Glisse
2015-10-21 21:00 ` [PATCH v11 09/15] HMM: add mm page table iterator helpers Jérôme Glisse
2015-10-21 21:00   ` Jérôme Glisse
2015-10-21 21:00 ` [PATCH v11 10/15] HMM: use CPU page table during invalidation Jérôme Glisse
2015-10-21 21:00   ` Jérôme Glisse
2015-10-21 21:00 ` [PATCH v11 11/15] HMM: add discard range helper (to clear and free resources for a range) Jérôme Glisse
2015-10-21 21:00   ` Jérôme Glisse
2015-10-21 21:00 ` [PATCH v11 12/15] HMM: add dirty range helper (toggle dirty bit inside mirror page table) v2 Jérôme Glisse
2015-10-21 21:00   ` Jérôme Glisse
2015-10-21 21:00 ` [PATCH v11 13/15] HMM: DMA map memory on behalf of device driver v2 Jérôme Glisse
2015-10-21 21:00   ` Jérôme Glisse
2015-10-21 21:00 ` [PATCH v11 14/15] HMM: Add support for hugetlb Jérôme Glisse
2015-10-21 21:00   ` Jérôme Glisse
2015-10-21 21:00 ` [PATCH v11 15/15] HMM: add documentation explaining HMM internals and how to use it Jérôme Glisse
2015-10-21 21:00   ` Jérôme Glisse
2015-10-22  3:23   ` Randy Dunlap [this message]
2015-10-22  3:23     ` Randy Dunlap
2015-10-22 14:11     ` Jerome Glisse
2015-10-22 14:11       ` Jerome Glisse
2015-10-28  1:19     ` David Woodhouse
2015-10-28 17:10       ` Randy Dunlap
2015-10-28 17:10         ` Randy Dunlap
2015-10-25 10:00 ` [PATCH v11 00/15] HMM (Heterogeneous Memory Management) Haggai Eran

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=562856BD.3020806@infradead.org \
    --to=rdunlap@infradead.org \
    --cc=Alexander.Deucher@amd.com \
    --cc=Greg.Stoner@amd.com \
    --cc=John.Bridgman@amd.com \
    --cc=Laurent.Morichetti@amd.com \
    --cc=Leonid.Shamis@amd.com \
    --cc=Michael.Mantor@amd.com \
    --cc=Paul.Blinzer@amd.com \
    --cc=SCheung@nvidia.com \
    --cc=aarcange@redhat.com \
    --cc=airlied@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=arvindg@nvidia.com \
    --cc=ben.sander@amd.com \
    --cc=blc@redhat.com \
    --cc=cabuschardt@nvidia.com \
    --cc=charle@nvidia.com \
    --cc=dpoole@nvidia.com \
    --cc=haggaie@mellanox.com \
    --cc=hpa@zytor.com \
    --cc=jdonohue@redhat.com \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=joro@8bytes.org \
    --cc=jweiner@redhat.com \
    --cc=ldunning@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liranl@mellanox.com \
    --cc=lwoodman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mhairgrove@nvidia.com \
    --cc=peterz@infradead.org \
    --cc=raindel@mellanox.com \
    --cc=riel@redhat.com \
    --cc=roland@purestorage.com \
    --cc=sgutti@nvidia.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.