From mboxrd@z Thu Jan 1 00:00:00 1970 From: j.glisse@gmail.com Subject: [PATCH 06/11] hmm: heterogeneous memory management Date: Fri, 2 May 2014 09:52:05 -0400 Message-ID: <1399038730-25641-7-git-send-email-j.glisse@gmail.com> References: <1399038730-25641-1-git-send-email-j.glisse@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= , Sherry Cheung , Subhash Gutti , Mark Hairgrove , John Hubbard , Jatin Kumar To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org Return-path: In-Reply-To: <1399038730-25641-1-git-send-email-j.glisse@gmail.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org From: J=C3=A9r=C3=B4me Glisse Motivation: Heterogeneous memory management is intended to allow a device to transpar= ently access a process address space without having to lock pages of the proces= s or take references on them. In other word mirroring a process address space = while allowing the regular memory management event such as page reclamation or = page migration, to happen seamlessly. Recent years have seen a surge into the number of specialized devices tha= t are part of a computer platform (from desktop to phone). So far each of those devices have operated on there own private address space that is not link= or expose to the process address space that is using them. This separation o= ften leads to multiple memory copy happening between the device owned memory a= nd the process memory. This of course is both a waste of cpu cycle and memory. Over the last few years most of those devices have gained a full mmu allo= wing them to support multiple page table, page fault and other features that a= re found inside cpu mmu. There is now a strong incentive to start leveraging capabilities of such devices and to start sharing process address to avoi= d any unnecessary memory copy as well as simplifying the programming model = of those devices by sharing an unique and common address space with the proc= ess that use them. The aim of the heterogeneous memory management is to provide a common API= that can be use by any such devices in order to mirror process address. The hm= m code provide an unique entry point and interface itself with the core mm code = of the linux kernel avoiding duplicate implementation and shielding device drive= r code from core mm code. Moreover, hmm also intend to provide support for migrating memory to devi= ce private memory, allowing device to work on its own fast local memory. The= hmm code would be responsible to intercept cpu page fault on migrated range o= f and to migrate it back to system memory allowing cpu to resume its access to = the memory. Another feature hmm intend to provide is support for atomic operation for= the device even if the bus linking the device and the cpu do not have any suc= h capabilities. We expect that graphic processing unit and network interface to be among = the first prominent users of such api. Hardware requirement: Because hmm is intended to be use by device driver there are minimum feat= ures requirement for the hardware mmu : - hardware have its own page table per process (can be share btw !=3D d= evices) - hardware mmu support page fault and suspend execution until the page = fault is serviced by hmm code. The page fault must also trigger some form o= f interrupt so that hmm code can be call by the device driver. - hardware must support at least read only mapping (otherwise it can no= t access read only range of the process address space). For better memory management it is highly recommanded that the device als= o support the following features : - hardware mmu set access bit in its page table on memory access (like = cpu). - hardware page table can be updated from cpu or through a fast path. - hardware provide advanced statistic over which range of memory it acc= ess the most. - hardware differentiate atomic memory access from regular access allow= ing to support atomic operation even on platform that do not have atomic support with there bus link with the device. Implementation: The hmm layer provide a simple API to the device driver. Each device driv= er have to register and hmm device that holds pointer to all the callback th= e hmm code will make to synchronize the device page table with the cpu page tab= le of a given process. For each process it wants to mirror the device driver must register a mir= ror hmm structure that holds all the informations specific to the process bei= ng mirrored. Each hmm mirror uniquely link an hmm device with a process addr= ess space (the mm struct). This design allow several different device driver to mirror concurrently = the same process. The hmm layer will dispatch approprietly to each device dri= ver modification that are happening to the process address space. The hmm layer rely on the mmu notifier api to monitor change to the proce= ss address space. Because update to device page table can have unbound compl= etion time, the hmm layer need the capability to sleep during mmu notifier call= back. This patch only implement the core of the hmm layer and do not support fe= ature such as migration to device memory. Signed-off-by: J=C3=A9r=C3=B4me Glisse Signed-off-by: Sherry Cheung Signed-off-by: Subhash Gutti Signed-off-by: Mark Hairgrove Signed-off-by: John Hubbard Signed-off-by: Jatin Kumar --- include/linux/hmm.h | 470 ++++++++++++++++++ include/linux/mm_types.h | 14 + kernel/fork.c | 6 + mm/Kconfig | 12 + mm/Makefile | 1 + mm/hmm.c | 1194 ++++++++++++++++++++++++++++++++++++++++= ++++++ 6 files changed, 1697 insertions(+) create mode 100644 include/linux/hmm.h create mode 100644 mm/hmm.c diff --git a/include/linux/hmm.h b/include/linux/hmm.h new file mode 100644 index 0000000..e9c7722 --- /dev/null +++ b/include/linux/hmm.h @@ -0,0 +1,470 @@ +/* + * Copyright 2013 Red Hat Inc. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 = USA + * + * Authors: J=C3=A9r=C3=B4me Glisse + */ +/* This is a heterogeneous memory management (hmm). In a nutshell this p= rovide + * an API to mirror a process address on a device which has its own mmu = and its + * own page table for the process. It supports everything except special= /mixed + * vma. + * + * To use this the hardware must have : + * - mmu with pagetable + * - pagetable must support read only (supporting dirtyness accounting= is + * preferable but is not mandatory). + * - support pagefault ie hardware thread should stop on fault and res= ume + * once hmm has provided valid memory to use. + * - some way to report fault. + * + * The hmm code handle all the interfacing with the core kernel mm code = and + * provide a simple API. It does support migrating system memory to devi= ce + * memory and handle migration back to system memory on cpu page fault. + * + * Migrated memory is considered as swaped from cpu and core mm code poi= nt of + * view. + */ +#ifndef _HMM_H +#define _HMM_H + +#ifdef CONFIG_HMM + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + + +struct hmm_device; +struct hmm_device_ops; +struct hmm_migrate; +struct hmm_mirror; +struct hmm_fault; +struct hmm_event; +struct hmm; + +/* The hmm provide page informations to the device using hmm pfn value. = Below + * are the various flags that define the current state the pfn is in (va= lid, + * type of page, dirty page, page is locked or not, ...). + * + * HMM_PFN_VALID_PAGE this means the pfn correspond to valid page. + * HMM_PFN_VALID_ZERO this means the pfn is the special zero page. + * HMM_PFN_DIRTY set when the page is dirty. + * HMM_PFN_WRITE is set if there is no need to call page_mkwrite + */ +#define HMM_PFN_SHIFT (PAGE_SHIFT) +#define HMM_PFN_VALID_PAGE (0UL) +#define HMM_PFN_VALID_ZERO (1UL) +#define HMM_PFN_DIRTY (2UL) +#define HMM_PFN_WRITE (3UL) + +static inline struct page *hmm_pfn_to_page(unsigned long pfn) +{ + /* Ok to test on bit after the other as it can not flip from one to + * the other. Both bit are constant for the lifetime of an rmem + * object. + */ + if (!test_bit(HMM_PFN_VALID_PAGE, &pfn) && + !test_bit(HMM_PFN_VALID_ZERO, &pfn)) { + return NULL; + } + return pfn_to_page(pfn >> HMM_PFN_SHIFT); +} + +static inline void hmm_pfn_set_dirty(unsigned long *pfn) +{ + set_bit(HMM_PFN_DIRTY, pfn); +} + + +/* hmm_fence - device driver fence to wait for device driver operations. + * + * In order to concurrently update several different devices mmu the hmm= rely + * on device driver fence to wait for operation hmm has schedule to comp= lete on + * the device. It is strongly recommanded to implement fences and have t= he hmm + * callback do as little as possible (just scheduling the update). Moreo= ver the + * hmm code will reschedule for i/o the current process if necessary onc= e it + * has scheduled all updates on all devices. + * + * Each fence is created as a result of either an update to range of mem= ory or + * for remote memory to/from local memory dma. + * + * Update to range of memory correspond to a specific event type. For in= stance + * range of memory is unmap for page reclamation, or range of memory is = unmap + * from process address as result of munmap syscall (HMM_RANGE_FINI), or= there + * a memory protection change on the range. There is one hmm_etype for e= ach of + * those event allowing the device driver to take appropriate action lik= e for + * instance freeing device page table on HMM_RANGE_FINI but keeping it i= f it is + * HMM_RANGE_UNMAP (which means that the range is unmap but the range is= still + * valid). + */ +enum hmm_etype { + HMM_NONE =3D 0, + HMM_UNREGISTER, + HMM_DEVICE_FAULT, + HMM_MPROT_RONLY, + HMM_MPROT_RANDW, + HMM_MPROT_WONLY, + HMM_COW, + HMM_MUNMAP, + HMM_UNMAP, + HMM_MIGRATE_TO_LMEM, + HMM_MIGRATE_TO_RMEM, +}; + +struct hmm_fence { + struct list_head list; + struct hmm_mirror *mirror; +}; + + + + +/* hmm_device - Each device driver must register one and only one hmm_de= vice. + * + * The hmm_device is the link btw hmm and each device driver. + */ + +/* struct hmm_device_operations - hmm device operation callback + */ +struct hmm_device_ops { + /* device_destroy - free hmm_device (call when refcount drop to 0). + * + * @device: The device hmm specific structure. + */ + void (*device_destroy)(struct hmm_device *device); + + /* mirror_release() - device must stop using the address space. + * + * @mirror: The mirror that link process address space with the device. + * + * Called when as result of hmm_mirror_unregister or when mm is being + * destroy. + * + * It's illegal for the device to call any hmm helper function after + * this call back. The device driver must kill any pending device + * thread and wait for completion of all of them. + * + * Note that even after this callback returns the device driver might + * get call back from hmm. Callback will stop only once mirror_destroy + * is call. + */ + void (*mirror_release)(struct hmm_mirror *hmm_mirror); + + /* mirror_destroy - free hmm_mirror (call when refcount drop to 0). + * + * @mirror: The mirror that link process address space with the device. + */ + void (*mirror_destroy)(struct hmm_mirror *mirror); + + /* fence_wait() - to wait on device driver fence. + * + * @fence: The device driver fence struct. + * Returns: 0 on success,-EIO on error, -EAGAIN to wait again. + * + * Called when hmm want to wait for all operations associated with a + * fence to complete (including device cache flush if the event mandate + * it). + * + * Device driver must free fence and associated resources if it returns + * something else thant -EAGAIN. On -EAGAIN the fence must not be free + * as hmm will call back again. + * + * Return error if scheduled operation failed or if need to wait again. + * -EIO Some input/output error with the device. + * -EAGAIN The fence not yet signaled, hmm reschedule waiting thread. + * + * All other return value trigger warning and are transformed to -EIO. + */ + int (*fence_wait)(struct hmm_fence *fence); + + /* lmem_update() - update device mmu for a range of local memory. + * + * @mirror: The mirror that link process address space with the device. + * @faddr: First address in range (inclusive). + * @laddr: Last address in range (exclusive). + * @etype: The type of memory event (unmap, fini, read only, ...). + * @dirty: Device driver should call set_page_dirty_lock. + * Returns: Valid fence ptr or NULL on success otherwise ERR_PTR. + * + * Called to update device mmu permission/usage for a range of local + * memory. The event type provide the nature of the update : + * - range is no longer valid (munmap). + * - range protection changes (mprotect, COW, ...). + * - range is unmapped (swap, reclaim, page migration, ...). + * - ... + * + * Any event that block further write to the memory must also trigger a + * device cache flush and everything has to be flush to local memory by + * the time the wait callback return (if this callback returned a fence + * otherwise everything must be flush by the time the callback return). + * + * Device must properly call set_page_dirty on any page the device did + * write to since last call to update_lmem. This is only needed if the + * dirty parameter is true. + * + * The driver should return a fence pointer or NULL on success. It is + * advice to return fence and delay wait for the operation to complete + * to the wait callback. Returning a fence allow hmm to batch update to + * several devices and delay wait on those once they all have scheduled + * the update. + * + * Device driver must not fail lightly, any failure result in device + * process being kill. + * + * IMPORTANT IF DEVICE DRIVER GET HMM_MPROT_RANDW or HMM_MPROT_WONLY IT + * MUST NOT MAP SPECIAL ZERO PFN WITH WRITE PERMISSION. SPECIAL ZERO + * PFN IS SET THROUGH lmem_fault WITH THE HMM_PFN_VALID_ZERO BIT FLAG + * SET. + * + * Return fence or NULL on success, error value otherwise : + * -ENOMEM Not enough memory for performing the operation. + * -EIO Some input/output error with the device. + * + * All other return value trigger warning and are transformed to -EIO. + */ + struct hmm_fence *(*lmem_update)(struct hmm_mirror *mirror, + unsigned long faddr, + unsigned long laddr, + enum hmm_etype etype, + bool dirty); + + /* lmem_fault() - fault range of lmem on the device mmu. + * + * @mirror: The mirror that link process address space with the device. + * @faddr: First address in range (inclusive). + * @laddr: Last address in range (exclusive). + * @pfns: Array of pfn for the range (each of the pfn is valid). + * @fault: The fault structure provided by device driver. + * Returns: 0 on success, error value otherwise. + * + * Called to give the device driver each of the pfn backing a range of + * memory. It is only call as a result of a call to hmm_mirror_fault. + * + * Note that the pfns array content is only valid for the duration of + * the callback. Once the device driver callback return further memory + * activities might invalidate the value of the pfns array. The device + * driver will be inform of such changes through the update callback. + * + * Allowed return value are : + * -ENOMEM Not enough memory for performing the operation. + * -EIO Some input/output error with the device. + * + * Device driver must not fail lightly, any failure result in device + * process being kill. + * + * Return error if scheduled operation failed. Valid value : + * -ENOMEM Not enough memory for performing the operation. + * -EIO Some input/output error with the device. + * + * All other return value trigger warning and are transformed to -EIO. + */ + int (*lmem_fault)(struct hmm_mirror *mirror, + unsigned long faddr, + unsigned long laddr, + unsigned long *pfns, + struct hmm_fault *fault); +}; + +/* struct hmm_device - per device hmm structure + * + * @kref: Reference count. + * @mirrors: List of all active mirrors for the device. + * @mutex: Mutex protecting mirrors list. + * @ops: The hmm operations callback. + * @name: Device name (uniquely identify the device on the system)= . + * + * Each device that want to mirror an address space must register one of= this + * struct (only once). + */ +struct hmm_device { + struct kref kref; + struct list_head mirrors; + struct mutex mutex; + const struct hmm_device_ops *ops; + const char *name; +}; + +/* hmm_device_register() - register a device with hmm. + * + * @device: The hmm_device struct. + * @name: A unique name string for the device (use in error messages). + * Returns: 0 on success, -EINVAL otherwise. + * + * Call when device driver want to register itself with hmm. Device driv= er can + * only register once. It will return a reference on the device thus to = release + * a device the driver must unreference the device. + */ +int hmm_device_register(struct hmm_device *device, const char *name); + +struct hmm_device *hmm_device_ref(struct hmm_device *device); +struct hmm_device *hmm_device_unref(struct hmm_device *device); + + + + +/* hmm_mirror - device specific mirroring functions. + * + * Each device that mirror a process has a uniq hmm_mirror struct associ= ating + * the process address space with the device. A process can be mirrored = by + * several different devices at the same time. + */ + +/* struct hmm_mirror - per device and per mm hmm structure + * + * @kref: Reference count. + * @dlist: List of all hmm_mirror for same device. + * @mlist: List of all hmm_mirror for same mm. + * @device: The hmm_device struct this hmm_mirror is associated to. + * @hmm: The hmm struct this hmm_mirror is associated to. + * @dead: The hmm_mirror is dead and should no longer be use. + * + * Each device that want to mirror an address space must register one of= this + * struct for each of the address space it wants to mirror. Same device = can + * mirror several different address space. As well same address space ca= n be + * mirror by different devices. + */ +struct hmm_mirror { + struct kref kref; + struct list_head dlist; + struct list_head mlist; + struct hmm_device *device; + struct hmm *hmm; + bool dead; +}; + +/* hmm_mirror_register() - register a device mirror against an mm struct + * + * @mirror: The mirror that link process address space with the device. + * @device: The device struct to associate this mirror with. + * @mm: The mm struct of the process. + * Returns: 0 success, -ENOMEM, -EBUSY or -EINVAL if process already mir= rored. + * + * Call when device driver want to start mirroring a process address spa= ce. The + * hmm shim will register mmu_notifier and start monitoring process addr= ess + * space changes. Hence callback to device driver might happen even befo= re this + * function return. + * + * The mm pin must also be hold (either task is current or using get_tas= k_mm). + * + * Only one mirror per mm and hmm_device can be created, it will return = -EINVAL + * if the hmm_device already has an hmm_mirror for the the mm. + * + * If the mm or previous hmm is in transient state then this will return= -EBUSY + * and device driver must retry the call after unpinning the mm and chec= king + * again that the mm is valid. + * + * On success the mirror is returned with one reference for the caller, = thus to + * release mirror call hmm_mirror_unref. + */ +int hmm_mirror_register(struct hmm_mirror *mirror, + struct hmm_device *device, + struct mm_struct *mm); + +/* hmm_mirror_unregister() - unregister an hmm_mirror. + * + * @mirror: The mirror that link process address space with the device. + * + * Call when device driver want to stop mirroring a process address spac= e. + */ +void hmm_mirror_unregister(struct hmm_mirror *mirror); + +/* struct hmm_fault - device mirror fault informations + * + * @vma: The vma into which the fault range is (set by hmm). + * @faddr: First address of the range device want to fault (set by driv= er and + * updated by hmm to the actual first faulted address). + * @laddr: Last address of the range device want to fault (set by drive= r and + * updated by hmm to the actual last faulted address). + * @pfns: Array to hold the pfn value of each page in the range (provi= ded by + * device driver, big enough to hold (laddr - faddr) >> PAGE_SH= IFT). + * @flags: Fault flags (set by driver). + * + * This structure is given by the device driver to hmm_mirror_fault. The= device + * driver can encapsulate the hmm_fault struct into its own fault struct= ure and + * use that to provide private device driver information to the lmem_fau= lt + * callback. + */ +struct hmm_fault { + struct vm_area_struct *vma; + unsigned long faddr; + unsigned long laddr; + unsigned long *pfns; + unsigned long flags; +}; + +#define HMM_FAULT_WRITE (1 << 0) + +/* hmm_mirror_fault() - call by the device driver on device memory fault= . + * + * @mirror: The mirror that link process address space with the devi= ce. + * @fault: The mirror fault struct holding fault range informations= . + * + * Call when device is trying to access an invalid address in the device= page + * table. The hmm shim will call lmem_fault with strong ordering in resp= ect to + * call to lmem_update (ie any information provided to lmem_fault is val= id + * until the device callback return). + * + * It will try to fault all pages in the range and give their pfn. If th= e vma + * covering the range needs to grow then it will. + * + * Also the fault will clamp the requested range to valid vma range (unl= ess + * the vma into which event->faddr falls to, can grow). + * + * All error must be handled by device driver and most likely result in = the + * process device tasks to be kill by the device driver. + * + * Returns: + * > 0 Number of pages faulted. + * -EINVAL if invalid argument. + * -ENOMEM if failing to allocate memory. + * -EACCES if trying to write to read only address (only for faddr). + * -EFAULT if trying to access an invalid address (only for faddr). + * -ENODEV if mirror is in process of being destroy. + */ +int hmm_mirror_fault(struct hmm_mirror *mirror, + struct hmm_fault *fault); + +struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror); +struct hmm_mirror *hmm_mirror_unref(struct hmm_mirror *mirror); + + + + +/* Functions used by core mm code. Device driver should not use any of t= hem. */ +void __hmm_destroy(struct mm_struct *mm); +static inline void hmm_destroy(struct mm_struct *mm) +{ + if (mm->hmm) { + __hmm_destroy(mm); + } +} + +#else /* !CONFIG_HMM */ + +static inline void hmm_destroy(struct mm_struct *mm) +{ +} + +#endif /* !CONFIG_HMM */ + +#endif diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index de16272..8fa66cc 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -16,6 +16,10 @@ #include #include =20 +#ifdef CONFIG_HMM +struct hmm; +#endif + #ifndef AT_VECTOR_SIZE_ARCH #define AT_VECTOR_SIZE_ARCH 0 #endif @@ -425,6 +429,16 @@ struct mm_struct { #ifdef CONFIG_MMU_NOTIFIER struct mmu_notifier_mm *mmu_notifier_mm; #endif +#ifdef CONFIG_HMM + /* + * hmm always register an mmu_notifier we rely on mmu notifier to keep + * refcount on mm struct as well as forbiding registering hmm on a + * dying mm + * + * This field is set with mmap_sem old in write mode. + */ + struct hmm *hmm; +#endif #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS pgtable_t pmd_huge_pte; /* protected by page_table_lock */ #endif diff --git a/kernel/fork.c b/kernel/fork.c index 0d53eb0..56fce77 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include #include @@ -602,6 +603,8 @@ void __mmdrop(struct mm_struct *mm) mm_free_pgd(mm); destroy_context(mm); mmu_notifier_mm_destroy(mm); + /* hmm_destroy needs to be call after mmu_notifier_mm_destroy */ + hmm_destroy(mm); check_mm(mm); free_mm(mm); } @@ -820,6 +823,9 @@ static struct mm_struct *dup_mm(struct task_struct *t= sk) =20 memcpy(mm, oldmm, sizeof(*mm)); mm_init_cpumask(mm); +#ifdef CONFIG_HMM + mm->hmm =3D NULL; +#endif =20 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS mm->pmd_huge_pte =3D NULL; diff --git a/mm/Kconfig b/mm/Kconfig index 30cb6cb..7836f17 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -584,3 +584,15 @@ config PGTABLE_MAPPING =20 config GENERIC_EARLY_IOREMAP bool + +config HMM + bool "Enable heterogeneous memory management (HMM)" + depends on MMU + select MMU_NOTIFIER + default n + help + Heterogeneous memory management provide infrastructure for a device + to mirror a process address space into an hardware mmu or into any + things supporting pagefault like event. + + If unsure, say N to disable hmm. diff --git a/mm/Makefile b/mm/Makefile index b484452..d231646 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -63,3 +63,4 @@ obj-$(CONFIG_MEMORY_ISOLATION) +=3D page_isolation.o obj-$(CONFIG_ZBUD) +=3D zbud.o obj-$(CONFIG_ZSMALLOC) +=3D zsmalloc.o obj-$(CONFIG_GENERIC_EARLY_IOREMAP) +=3D early_ioremap.o +obj-$(CONFIG_HMM) +=3D hmm.o diff --git a/mm/hmm.c b/mm/hmm.c new file mode 100644 index 0000000..2b8986c --- /dev/null +++ b/mm/hmm.c @@ -0,0 +1,1194 @@ +/* + * Copyright 2013 Red Hat Inc. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 = USA + * + * Authors: J=C3=A9r=C3=B4me Glisse + */ +/* This is the core code for heterogeneous memory management (HMM). HMM = intend + * to provide helper for mirroring a process address space on a device a= s well + * as allowing migration of data between local memory and device memory. + * + * Refer to include/linux/hmm.h for further informations on general desi= gn. + */ +/* Locking : + * + * To synchronize with various mm event there is a simple serializatio= n of + * event touching overlapping range of address. Each mm event is assoc= iated + * with an hmm_event structure which store the address range of the ev= ent. + * + * When a new mm event call in hmm (most call comes through the mmu_no= tifier + * call backs) hmm allocate an hmm_event structure and wait for all pe= nding + * event that overlap with the new event. + * + * To avoid deadlock with mmap_sem the rules it to always allocate new= hmm + * event after taking the mmap_sem lock. In case of mmu_notifier call = we do + * not take the mmap_sem lock as if it was needed it would have been t= aken + * by the caller of the mmu_notifier API. + * + * Hence hmm only need to make sure to allocate new hmm event after ta= king + * the mmap_sem. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "internal.h" + +#define HMM_MAX_RANGE_BITS (PAGE_SHIFT + 3UL) +#define HMM_MAX_RANGE_SIZE (PAGE_SIZE << HMM_MAX_RANGE_BITS) +#define MM_MAX_SWAP_PAGES (swp_offset(pte_to_swp_entry(swp_entry_to_pte(= swp_entry(0, ~0UL)))) + 1UL) +#define HMM_MAX_ADDR (((unsigned long)PTRS_PER_PGD) << ((unsigned long)= PGDIR_SHIFT)) + +#define HMM_MAX_EVENTS 16 + +/* global SRCU for all MMs */ +static struct srcu_struct srcu; + + + + +/* struct hmm_event - used to serialize change to overlapping range of a= ddress. + * + * @list: Current event list for the corresponding hmm. + * @faddr: First address (inclusive) for the range this event affec= t. + * @laddr: Last address (exclusive) for the range this event affect= . + * @fences: List of device fences associated with this event. + * @etype: Event type (munmap, migrate, truncate, ...). + * @backoff: Should this event backoff ie a new event render it obsol= ete. + */ +struct hmm_event { + struct list_head list; + unsigned long faddr; + unsigned long laddr; + struct list_head fences; + enum hmm_etype etype; + bool backoff; +}; + +/* struct hmm - per mm_struct hmm structure + * + * @mm: The mm struct. + * @kref: Reference counter + * @lock: Serialize the mirror list modifications. + * @mirrors: List of all mirror for this mm (one per device) + * @mmu_notifier: The mmu_notifier of this mm + * @wait_queue: Wait queue for synchronization btw cpu and device + * @events: Events. + * @nevents: Number of events currently happening. + * @dead: The mm is being destroy. + * + * For each process address space (mm_struct) there is one and only one = hmm + * struct. hmm functions will redispatch to each devices the change into= the + * process address space. + */ +struct hmm { + struct mm_struct *mm; + struct kref kref; + spinlock_t lock; + struct list_head mirrors; + struct list_head pending; + struct mmu_notifier mmu_notifier; + wait_queue_head_t wait_queue; + struct hmm_event events[HMM_MAX_EVENTS]; + int nevents; + bool dead; +}; + +static struct mmu_notifier_ops hmm_notifier_ops; + +static inline struct hmm *hmm_ref(struct hmm *hmm); +static inline struct hmm *hmm_unref(struct hmm *hmm); + +static int hmm_mirror_update(struct hmm_mirror *mirror, + struct vm_area_struct *vma, + unsigned long faddr, + unsigned long laddr, + struct hmm_event *event); +static void hmm_mirror_cleanup(struct hmm_mirror *mirror); + +static int hmm_device_fence_wait(struct hmm_device *device, + struct hmm_fence *fence); + + + + +/* hmm_event - use to synchronize various mm events with each others. + * + * During life time of process various mm events will happen, hmm serial= ize + * event that affect overlapping range of address. The hmm_event are use= for + * that purpose. + */ + +static inline bool hmm_event_overlap(struct hmm_event *a, struct hmm_eve= nt *b) +{ + return !((a->laddr <=3D b->faddr) || (a->faddr >=3D b->laddr)); +} + +static inline unsigned long hmm_event_size(struct hmm_event *event) +{ + return (event->laddr - event->faddr); +} + + + + +/* hmm_fault_mm - used for reading cpu page table on device fault. + * + * This code deals with reading the cpu page table to find the pages tha= t are + * backing a range of address. It is use as an helper to the device page= fault + * code. + */ + +/* struct hmm_fault_mm - used for reading cpu page table on device fault= . + * + * @mm: The mm of the process the device fault is happening in. + * @vma: The vma in which the fault is happening. + * @faddr: The first address for the range the device want to fault. + * @laddr: The last address for the range the device want to fault. + * @pfns: Array of hmm pfns (contains the result of the fault). + * @write: Is this write fault. + */ +struct hmm_fault_mm { + struct mm_struct *mm; + struct vm_area_struct *vma; + unsigned long faddr; + unsigned long laddr; + unsigned long *pfns; + bool write; +}; + +static int hmm_fault_mm_fault_pmd(pmd_t *pmdp, + unsigned long faddr, + unsigned long laddr, + struct mm_walk *walk) +{ + struct hmm_fault_mm *fault_mm =3D walk->private; + unsigned long idx, *pfns; + pte_t *ptep; + + idx =3D (faddr - fault_mm->faddr) >> PAGE_SHIFT; + pfns =3D &fault_mm->pfns[idx]; + memset(pfns, 0, ((laddr - faddr) >> PAGE_SHIFT) * sizeof(long)); + if (pmd_none(*pmdp)) { + return -ENOENT; + } + + if (pmd_trans_huge(*pmdp)) { + /* FIXME */ + return -EINVAL; + } + + if (pmd_none_or_trans_huge_or_clear_bad(pmdp)) { + return -EINVAL; + } + + ptep =3D pte_offset_map(pmdp, faddr); + for (; faddr !=3D laddr; ++ptep, ++pfns, faddr +=3D PAGE_SIZE) { + pte_t pte =3D *ptep; + + if (pte_none(pte)) { + if (fault_mm->write) { + ptep++; + break; + } + *pfns =3D my_zero_pfn(faddr) << HMM_PFN_SHIFT; + set_bit(HMM_PFN_VALID_ZERO, pfns); + continue; + } + if (!pte_present(pte) || (fault_mm->write && !pte_write(pte))) { + /* Need to inc ptep so unmap unlock on right pmd. */ + ptep++; + break; + } + + *pfns =3D pte_pfn(pte) << HMM_PFN_SHIFT; + set_bit(HMM_PFN_VALID_PAGE, pfns); + if (pte_write(pte)) { + set_bit(HMM_PFN_WRITE, pfns); + } + /* Consider the page as hot as a device want to use it. */ + mark_page_accessed(pfn_to_page(pte_pfn(pte))); + fault_mm->laddr =3D faddr + PAGE_SIZE; + } + pte_unmap(ptep - 1); + + return (faddr =3D=3D laddr) ? 0 : -ENOENT; +} + +static int hmm_fault_mm_fault(struct hmm_fault_mm *fault_mm) +{ + struct mm_walk walk =3D {0}; + unsigned long faddr, laddr; + int ret; + + faddr =3D fault_mm->faddr; + laddr =3D fault_mm->laddr; + fault_mm->laddr =3D faddr; + + walk.pmd_entry =3D hmm_fault_mm_fault_pmd; + walk.mm =3D fault_mm->mm; + walk.private =3D fault_mm; + + ret =3D walk_page_range(faddr, laddr, &walk); + return ret; +} + + + + +/* hmm - core hmm functions. + * + * Core hmm functions that deal with all the process mm activities and u= se + * event for synchronization. Those function are use mostly as result of= cpu + * mm event. + */ + +static int hmm_init(struct hmm *hmm, struct mm_struct *mm) +{ + int i, ret; + + hmm->mm =3D mm; + kref_init(&hmm->kref); + INIT_LIST_HEAD(&hmm->mirrors); + INIT_LIST_HEAD(&hmm->pending); + spin_lock_init(&hmm->lock); + init_waitqueue_head(&hmm->wait_queue); + + for (i =3D 0; i < HMM_MAX_EVENTS; ++i) { + hmm->events[i].etype =3D HMM_NONE; + INIT_LIST_HEAD(&hmm->events[i].fences); + } + + /* register notifier */ + hmm->mmu_notifier.ops =3D &hmm_notifier_ops; + ret =3D __mmu_notifier_register(&hmm->mmu_notifier, mm); + return ret; +} + +static enum hmm_etype hmm_event_mmu(enum mmu_action action) +{ + switch (action) { + case MMU_MPROT_RONLY: + return HMM_MPROT_RONLY; + case MMU_MPROT_RANDW: + return HMM_MPROT_RANDW; + case MMU_MPROT_WONLY: + return HMM_MPROT_WONLY; + case MMU_COW: + return HMM_COW; + case MMU_MPROT_NONE: + case MMU_KSM: + case MMU_KSM_RONLY: + case MMU_UNMAP: + case MMU_VMSCAN: + case MMU_MUNLOCK: + case MMU_MIGRATE: + case MMU_FILE_WB: + case MMU_FAULT_WP: + case MMU_THP_SPLIT: + case MMU_THP_FAULT_WP: + return HMM_UNMAP; + case MMU_POISON: + case MMU_MREMAP: + case MMU_MUNMAP: + return HMM_MUNMAP; + case MMU_SOFT_DIRTY: + default: + return HMM_NONE; + } +} + +static void hmm_event_unqueue_locked(struct hmm *hmm, struct hmm_event *= event) +{ + list_del_init(&event->list); + event->etype =3D HMM_NONE; + hmm->nevents--; +} + +static void hmm_event_unqueue(struct hmm *hmm, struct hmm_event *event) +{ + spin_lock(&hmm->lock); + list_del_init(&event->list); + event->etype =3D HMM_NONE; + hmm->nevents--; + spin_unlock(&hmm->lock); +} + +static void hmm_destroy_kref(struct kref *kref) +{ + struct hmm *hmm; + struct mm_struct *mm; + + hmm =3D container_of(kref, struct hmm, kref); + mm =3D hmm->mm; + mm->hmm =3D NULL; + mmu_notifier_unregister(&hmm->mmu_notifier, mm); + + if (!list_empty(&hmm->mirrors)) { + BUG(); + printk(KERN_ERR "destroying an hmm with still active mirror\n" + "Leaking memory instead to avoid something worst.\n"); + return; + } + kfree(hmm); +} + +static inline struct hmm *hmm_ref(struct hmm *hmm) +{ + if (hmm) { + kref_get(&hmm->kref); + return hmm; + } + return NULL; +} + +static inline struct hmm *hmm_unref(struct hmm *hmm) +{ + if (hmm) { + kref_put(&hmm->kref, hmm_destroy_kref); + } + return NULL; +} + +static struct hmm_event *hmm_event_get(struct hmm *hmm, + unsigned long faddr, + unsigned long laddr, + enum hmm_etype etype) +{ + struct hmm_event *event, *wait =3D NULL; + enum hmm_etype wait_type; + unsigned id; + + do { + wait_event(hmm->wait_queue, hmm->nevents < HMM_MAX_EVENTS); + spin_lock(&hmm->lock); + for (id =3D 0; id < HMM_MAX_EVENTS; ++id) { + if (hmm->events[id].etype =3D=3D HMM_NONE) { + event =3D &hmm->events[id]; + goto out; + } + } + spin_unlock(&hmm->lock); + } while (1); + +out: + event->etype =3D etype; + event->faddr =3D faddr; + event->laddr =3D laddr; + event->backoff =3D false; + INIT_LIST_HEAD(&event->fences); + hmm->nevents++; + list_add_tail(&event->list, &hmm->pending); + +retry_wait: + wait =3D event; + list_for_each_entry_continue_reverse (wait, &hmm->pending, list) { + if (!hmm_event_overlap(event, wait)) { + continue; + } + switch (event->etype) { + case HMM_UNMAP: + case HMM_MUNMAP: + switch (wait->etype) { + case HMM_DEVICE_FAULT: + case HMM_MIGRATE_TO_RMEM: + wait->backoff =3D true; + /* fall through */ + default: + wait_type =3D wait->etype; + goto wait; + } + default: + wait_type =3D wait->etype; + goto wait; + } + } + spin_unlock(&hmm->lock); + + return event; + +wait: + spin_unlock(&hmm->lock); + wait_event(hmm->wait_queue, wait->etype !=3D wait_type); + spin_lock(&hmm->lock); + goto retry_wait; +} + +static void hmm_update_mirrors(struct hmm *hmm, + struct vm_area_struct *vma, + struct hmm_event *event) +{ + unsigned long faddr, laddr; + + for (faddr =3D event->faddr; faddr < event->laddr; faddr =3D laddr) { + struct hmm_mirror *mirror; + struct hmm_fence *fence =3D NULL, *tmp; + int ticket; + + laddr =3D event->laddr; + +retry_ranges: + ticket =3D srcu_read_lock(&srcu); + /* Because of retry we might already have scheduled some mirror + * skip those. + */ + mirror =3D list_first_entry(&hmm->mirrors, + struct hmm_mirror, + mlist); + mirror =3D fence ? fence->mirror : mirror; + list_for_each_entry_continue (mirror, &hmm->mirrors, mlist) { + int r; + + r =3D hmm_mirror_update(mirror,vma,faddr,laddr,event); + if (r) { + srcu_read_unlock(&srcu, ticket); + hmm_mirror_cleanup(mirror); + goto retry_ranges; + } + } + srcu_read_unlock(&srcu, ticket); + + list_for_each_entry_safe (fence, tmp, &event->fences, list) { + struct hmm_device *device; + int r; + + mirror =3D fence->mirror; + device =3D mirror->device; + + r =3D hmm_device_fence_wait(device, fence); + if (r) { + hmm_mirror_cleanup(mirror); + } + } + } +} + +static int hmm_fault_mm(struct hmm *hmm, + struct vm_area_struct *vma, + unsigned long faddr, + unsigned long laddr, + bool write) +{ + int r; + + if (laddr <=3D faddr) { + return -EINVAL; + } + + for (; faddr < laddr; faddr +=3D PAGE_SIZE) { + unsigned flags =3D 0; + + flags |=3D write ? FAULT_FLAG_WRITE : 0; + flags |=3D FAULT_FLAG_ALLOW_RETRY; + do { + r =3D handle_mm_fault(hmm->mm, vma, faddr, flags); + if (!(r & VM_FAULT_RETRY) && (r & VM_FAULT_ERROR)) { + if (r & VM_FAULT_OOM) { + return -ENOMEM; + } + /* Same error code for all other cases. */ + return -EFAULT; + } + flags &=3D ~FAULT_FLAG_ALLOW_RETRY; + } while (r & VM_FAULT_RETRY); + } + + return 0; +} + + + + +/* hmm_notifier - mmu_notifier hmm funcs tracking change to process mm. + * + * Callbacks for mmu notifier. We use use mmu notifier to track change m= ade to + * process address space. + * + * Note that none of this callback needs to take a reference, as we sure= that + * mm won't be destroy thus hmm won't be destroy either and it's fine if= some + * hmm_mirror/hmm_device are destroy during those callbacks because this= is + * serialize through either the hmm lock or the device lock. + */ + +static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_stru= ct *mm) +{ + struct hmm *hmm; + + if (!(hmm =3D hmm_ref(mm->hmm)) || hmm->dead) { + /* Already clean. */ + hmm_unref(hmm); + return; + } + + hmm->dead =3D true; + + /* + * hmm->lock allow synchronization with hmm_mirror_unregister() an + * hmm_mirror can be removed only once. + */ + spin_lock(&hmm->lock); + while (unlikely(!list_empty(&hmm->mirrors))) { + struct hmm_mirror *mirror; + struct hmm_device *device; + + mirror =3D list_first_entry(&hmm->mirrors, + struct hmm_mirror, + mlist); + device =3D mirror->device; + if (!mirror->dead) { + /* Update mirror as being dead and remove it from the + * mirror list before freeing up any of its resources. + */ + mirror->dead =3D true; + list_del_init(&mirror->mlist); + spin_unlock(&hmm->lock); + + synchronize_srcu(&srcu); + + device->ops->mirror_release(mirror); + hmm_mirror_cleanup(mirror); + spin_lock(&hmm->lock); + } + } + spin_unlock(&hmm->lock); + hmm_unref(hmm); +} + +static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn, + struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long faddr, + unsigned long laddr, + enum mmu_action action) +{ + struct hmm_event *event; + enum hmm_etype etype; + struct hmm *hmm; + + if (!(hmm =3D hmm_ref(mm->hmm))) { + return; + } + + etype =3D hmm_event_mmu(action); + switch (etype) { + case HMM_NONE: + hmm_unref(hmm); + return; + default: + break; + } + + faddr =3D faddr & PAGE_MASK; + laddr =3D PAGE_ALIGN(laddr); + + event =3D hmm_event_get(hmm, faddr, laddr, etype); + hmm_update_mirrors(hmm, vma, event); + /* Do not drop hmm reference here but in the range_end instead. */ +} + +static void hmm_notifier_invalidate_range_end(struct mmu_notifier *mn, + struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long faddr, + unsigned long laddr, + enum mmu_action action) +{ + struct hmm_event *event =3D NULL; + enum hmm_etype etype; + struct hmm *hmm; + int i; + + if (!(hmm =3D mm->hmm)) { + return; + } + + etype =3D hmm_event_mmu(action); + switch (etype) { + case HMM_NONE: + return; + default: + break; + } + + faddr =3D faddr & PAGE_MASK; + laddr =3D PAGE_ALIGN(laddr); + + spin_lock(&hmm->lock); + for (i =3D 0; i < HMM_MAX_EVENTS; ++i, event =3D NULL) { + event =3D &hmm->events[i]; + if (event->etype =3D=3D etype && + event->faddr =3D=3D faddr && + event->laddr =3D=3D laddr && + !list_empty(&event->list)) { + hmm_event_unqueue_locked(hmm, event); + break; + } + } + spin_unlock(&hmm->lock); + + /* Drop reference from invalidate_range_start. */ + hmm_unref(hmm); +} + +static void hmm_notifier_invalidate_page(struct mmu_notifier *mn, + struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long faddr, + enum mmu_action action) +{ + unsigned long laddr; + struct hmm_event *event; + enum hmm_etype etype; + struct hmm *hmm; + + if (!(hmm =3D hmm_ref(mm->hmm))) { + return; + } + + etype =3D hmm_event_mmu(action); + switch (etype) { + case HMM_NONE: + return; + default: + break; + } + + faddr =3D faddr & PAGE_MASK; + laddr =3D faddr + PAGE_SIZE; + + event =3D hmm_event_get(hmm, faddr, laddr, etype); + hmm_update_mirrors(hmm, vma, event); + hmm_event_unqueue(hmm, event); + hmm_unref(hmm); +} + +static struct mmu_notifier_ops hmm_notifier_ops =3D { + .release =3D hmm_notifier_release, + /* .clear_flush_young FIXME we probably want to do something. */ + /* .test_young FIXME we probably want to do something. */ + /* WARNING .change_pte must always bracketed by range_start/end there + * was patches to remove that behavior we must make sure that those + * patches are not included as alternative solution to issue they are + * trying to solve can be use. + * + * While hmm can not use the change_pte callback as non sleeping lock + * are held during change_pte callback. + */ + .change_pte =3D NULL, + .invalidate_page =3D hmm_notifier_invalidate_page, + .invalidate_range_start =3D hmm_notifier_invalidate_range_start, + .invalidate_range_end =3D hmm_notifier_invalidate_range_end, +}; + + + + +/* hmm_mirror - per device mirroring functions. + * + * Each device that mirror a process has a uniq hmm_mirror struct. A pro= cess + * can be mirror by several devices at the same time. + * + * Below are all the functions and there helpers use by device driver to= mirror + * the process address space. Those functions either deals with updating= the + * device page table (through hmm callback). Or provide helper functions= use by + * the device driver to fault in range of memory in the device page tabl= e. + */ + +static int hmm_mirror_update(struct hmm_mirror *mirror, + struct vm_area_struct *vma, + unsigned long faddr, + unsigned long laddr, + struct hmm_event *event) +{ + struct hmm_device *device =3D mirror->device; + struct hmm_fence *fence; + bool dirty =3D !!(vma->vm_file); + + fence =3D device->ops->lmem_update(mirror, faddr, laddr, + event->etype, dirty); + if (fence) { + if (IS_ERR(fence)) { + return PTR_ERR(fence); + } + fence->mirror =3D mirror; + list_add_tail(&fence->list, &event->fences); + } + return 0; +} + +static void hmm_mirror_cleanup(struct hmm_mirror *mirror) +{ + struct vm_area_struct *vma; + struct hmm_device *device =3D mirror->device; + struct hmm_event *event; + unsigned long faddr, laddr; + struct hmm *hmm =3D mirror->hmm; + + spin_lock(&hmm->lock); + if (mirror->dead) { + spin_unlock(&hmm->lock); + return; + } + mirror->dead =3D true; + list_del(&mirror->mlist); + spin_unlock(&hmm->lock); + synchronize_srcu(&srcu); + INIT_LIST_HEAD(&mirror->mlist); + + + event =3D hmm_event_get(hmm, 0UL, HMM_MAX_ADDR, HMM_UNREGISTER); + faddr =3D 0UL; + vma =3D find_vma(hmm->mm, faddr); + for (; vma && (faddr < HMM_MAX_ADDR); faddr =3D laddr) { + struct hmm_fence *fence, *next; + + faddr =3D max(faddr, vma->vm_start); + laddr =3D vma->vm_end; + + hmm_mirror_update(mirror, vma, faddr, laddr, event); + list_for_each_entry_safe (fence, next, &event->fences, list) { + hmm_device_fence_wait(device, fence); + } + + if (laddr >=3D vma->vm_end) { + vma =3D vma->vm_next; + } + } + hmm_event_unqueue(hmm, event); + + mutex_lock(&device->mutex); + list_del_init(&mirror->dlist); + mutex_unlock(&device->mutex); + + mirror->hmm =3D hmm_unref(hmm); + hmm_mirror_unref(mirror); +} + +static void hmm_mirror_destroy(struct kref *kref) +{ + struct hmm_mirror *mirror; + struct hmm_device *device; + + mirror =3D container_of(kref, struct hmm_mirror, kref); + device =3D mirror->device; + + BUG_ON(!list_empty(&mirror->mlist)); + BUG_ON(!list_empty(&mirror->dlist)); + + device->ops->mirror_destroy(mirror); + hmm_device_unref(device); +} + +struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror) +{ + if (mirror) { + kref_get(&mirror->kref); + return mirror; + } + return NULL; +} +EXPORT_SYMBOL(hmm_mirror_ref); + +struct hmm_mirror *hmm_mirror_unref(struct hmm_mirror *mirror) +{ + if (mirror) { + kref_put(&mirror->kref, hmm_mirror_destroy); + } + return NULL; +} +EXPORT_SYMBOL(hmm_mirror_unref); + +int hmm_mirror_register(struct hmm_mirror *mirror, + struct hmm_device *device, + struct mm_struct *mm) +{ + struct hmm *hmm =3D NULL; + int ret =3D 0; + + /* Sanity checks. */ + BUG_ON(!mirror); + BUG_ON(!device); + BUG_ON(!mm); + + /* Take reference on device only on success. */ + kref_init(&mirror->kref); + mirror->device =3D device; + mirror->dead =3D false; + INIT_LIST_HEAD(&mirror->mlist); + INIT_LIST_HEAD(&mirror->dlist); + + down_write(&mm->mmap_sem); + if (mm->hmm =3D=3D NULL) { + /* no hmm registered yet so register one */ + hmm =3D kzalloc(sizeof(*mm->hmm), GFP_KERNEL); + if (hmm =3D=3D NULL) { + ret =3D -ENOMEM; + goto out_cleanup; + } + + ret =3D hmm_init(hmm, mm); + if (ret) { + kfree(hmm); + hmm =3D NULL; + goto out_cleanup; + } + + /* now set hmm, make sure no mmu notifer callback might be call */ + ret =3D mm_take_all_locks(mm); + if (unlikely(ret)) { + goto out_cleanup; + } + mm->hmm =3D hmm; + mirror->hmm =3D hmm; + hmm =3D NULL; + } else { + struct hmm_mirror *tmp; + int id; + + id =3D srcu_read_lock(&srcu); + list_for_each_entry(tmp, &mm->hmm->mirrors, mlist) { + if (tmp->device =3D=3D mirror->device) { + /* A process can be mirrored only once by same + * device. + */ + srcu_read_unlock(&srcu, id); + ret =3D -EINVAL; + goto out_cleanup; + } + } + srcu_read_unlock(&srcu, id); + + ret =3D mm_take_all_locks(mm); + if (unlikely(ret)) { + goto out_cleanup; + } + mirror->hmm =3D hmm_ref(mm->hmm); + } + + /* + * A side note: hmm_notifier_release() can't run concurrently with + * us because we hold the mm_users pin (either implicitly as + * current->mm or explicitly with get_task_mm() or similar). + * + * We can't race against any other mmu notifier method either + * thanks to mm_take_all_locks(). + */ + spin_lock(&mm->hmm->lock); + list_add_rcu(&mirror->mlist, &mm->hmm->mirrors); + spin_unlock(&mm->hmm->lock); + mm_drop_all_locks(mm); + +out_cleanup: + if (hmm) { + mmu_notifier_unregister(&hmm->mmu_notifier, mm); + kfree(hmm); + } + up_write(&mm->mmap_sem); + + if (!ret) { + struct hmm_device *device =3D mirror->device; + + hmm_device_ref(device); + mutex_lock(&device->mutex); + list_add(&mirror->dlist, &device->mirrors); + mutex_unlock(&device->mutex); + } + return ret; +} +EXPORT_SYMBOL(hmm_mirror_register); + +void hmm_mirror_unregister(struct hmm_mirror *mirror) +{ + struct hmm *hmm; + + if (!mirror) { + return; + } + hmm =3D hmm_ref(mirror->hmm); + if (!hmm) { + return; + } + + down_read(&hmm->mm->mmap_sem); + hmm_mirror_cleanup(mirror); + up_read(&hmm->mm->mmap_sem); + hmm_unref(hmm); +} +EXPORT_SYMBOL(hmm_mirror_unregister); + +static int hmm_mirror_lmem_fault(struct hmm_mirror *mirror, + struct hmm_fault *fault, + unsigned long faddr, + unsigned long laddr, + unsigned long *pfns) +{ + struct hmm_device *device =3D mirror->device; + int ret; + + ret =3D device->ops->lmem_fault(mirror, faddr, laddr, pfns, fault); + return ret; +} + +/* see include/linux/hmm.h */ +int hmm_mirror_fault(struct hmm_mirror *mirror, + struct hmm_fault *fault) +{ + struct vm_area_struct *vma; + struct hmm_event *event; + unsigned long caddr, naddr, vm_flags; + struct hmm *hmm; + bool do_fault =3D false, write; + int ret =3D 0; + + if (!mirror || !fault || fault->faddr >=3D fault->laddr) { + return -EINVAL; + } + if (mirror->dead) { + return -ENODEV; + } + hmm =3D mirror->hmm; + + write =3D !!(fault->flags & HMM_FAULT_WRITE); + fault->faddr =3D fault->faddr & PAGE_MASK; + fault->laddr =3D PAGE_ALIGN(fault->laddr); + caddr =3D fault->faddr; + naddr =3D fault->laddr; + /* FIXME arbitrary value clamp fault to 4M at a time. */ + if ((fault->laddr - fault->faddr) > (4UL << 20UL)) { + fault->laddr =3D fault->faddr + (4UL << 20UL); + } + hmm_mirror_ref(mirror); + +retry: + down_read(&hmm->mm->mmap_sem); + event =3D hmm_event_get(hmm, caddr, naddr, HMM_DEVICE_FAULT); + /* FIXME handle gate area ? and guard page */ + vma =3D find_extend_vma(hmm->mm, caddr); + if (!vma) { + if (caddr > fault->faddr) { + /* Fault succeed up to addr. */ + fault->laddr =3D caddr; + ret =3D 0; + goto out; + } + /* Allow device driver to learn about first valid address in + * the range it was trying to fault in so it can restart the + * fault at this address. + */ + vma =3D find_vma_intersection(hmm->mm,event->faddr,event->laddr); + if (vma) { + fault->laddr =3D vma->vm_start; + } + ret =3D -EFAULT; + goto out; + } + /* FIXME support HUGETLB */ + if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP | VM_HUGETLB))) { + ret =3D -EFAULT; + goto out; + } + vm_flags =3D write ? VM_WRITE : VM_READ; + if (!(vma->vm_flags & vm_flags)) { + ret =3D -EACCES; + goto out; + } + /* Adjust range to this vma only. */ + fault->laddr =3D naddr =3D event->laddr =3D min(event->laddr, vma->vm_e= nd); + fault->vma =3D vma; + + for (; caddr < event->laddr;) { + struct hmm_fault_mm fault_mm; + + fault_mm.mm =3D vma->vm_mm; + fault_mm.vma =3D vma; + fault_mm.faddr =3D caddr; + fault_mm.laddr =3D naddr; + fault_mm.pfns =3D fault->pfns; + fault_mm.write =3D write; + ret =3D hmm_fault_mm_fault(&fault_mm); + if (ret =3D=3D -ENOENT && fault_mm.laddr =3D=3D caddr) { + do_fault =3D true; + goto out; + } + if (ret && ret !=3D -ENOENT) { + goto out; + } + if (mirror->dead) { + ret =3D -ENODEV; + goto out; + } + if (event->backoff) { + ret =3D -EAGAIN; + goto out; + } + + ret =3D hmm_mirror_lmem_fault(mirror, fault, + fault_mm.faddr, + fault_mm.laddr, + fault_mm.pfns); + if (ret) { + goto out; + } + caddr =3D fault_mm.laddr; + naddr =3D event->laddr; + } + +out: + hmm_event_unqueue(hmm, event); + if (do_fault && !event->backoff && !mirror->dead) { + do_fault =3D false; + ret =3D hmm_fault_mm(hmm, vma, caddr, naddr, write); + if (!ret) { + ret =3D -ENOENT; + } + } + wake_up(&hmm->wait_queue); + up_read(&hmm->mm->mmap_sem); + if (ret =3D=3D -ENOENT) { + if (!mirror->dead) { + naddr =3D fault->laddr; + goto retry; + } + ret =3D -ENODEV; + } + hmm_mirror_unref(mirror); + return ret; +} +EXPORT_SYMBOL(hmm_mirror_fault); + + + + +/* hmm_device - Each device driver must register one and only one hmm_de= vice + * + * The hmm_device is the link btw hmm and each device driver. + */ + +static void hmm_device_destroy(struct kref *kref) +{ + struct hmm_device *device; + + device =3D container_of(kref, struct hmm_device, kref); + BUG_ON(!list_empty(&device->mirrors)); + + device->ops->device_destroy(device); +} + +struct hmm_device *hmm_device_ref(struct hmm_device *device) +{ + if (device) { + kref_get(&device->kref); + return device; + } + return NULL; +} +EXPORT_SYMBOL(hmm_device_ref); + +struct hmm_device *hmm_device_unref(struct hmm_device *device) +{ + if (device) { + kref_put(&device->kref, hmm_device_destroy); + } + return NULL; +} +EXPORT_SYMBOL(hmm_device_unref); + +/* see include/linux/hmm.h */ +int hmm_device_register(struct hmm_device *device, const char *name) +{ + /* sanity check */ + BUG_ON(!device); + BUG_ON(!device->ops); + BUG_ON(!device->ops->device_destroy); + BUG_ON(!device->ops->mirror_release); + BUG_ON(!device->ops->mirror_destroy); + BUG_ON(!device->ops->fence_wait); + BUG_ON(!device->ops->lmem_update); + BUG_ON(!device->ops->lmem_fault); + + kref_init(&device->kref); + device->name =3D name; + mutex_init(&device->mutex); + INIT_LIST_HEAD(&device->mirrors); + + return 0; +} +EXPORT_SYMBOL(hmm_device_register); + +static int hmm_device_fence_wait(struct hmm_device *device, + struct hmm_fence *fence) +{ + int ret; + + if (fence =3D=3D NULL) { + return 0; + } + + list_del_init(&fence->list); + do { + io_schedule(); + ret =3D device->ops->fence_wait(fence); + } while (ret =3D=3D -EAGAIN); + + return ret; +} + + + + +/* This is called after the last hmm_notifier_release() returned */ +void __hmm_destroy(struct mm_struct *mm) +{ + kref_put(&mm->hmm->kref, hmm_destroy_kref); +} + +static int __init hmm_module_init(void) +{ + int ret; + + ret =3D init_srcu_struct(&srcu); + if (ret) { + return ret; + } + return 0; +} +module_init(hmm_module_init); + +static void __exit hmm_module_exit(void) +{ + cleanup_srcu_struct(&srcu); +} +module_exit(hmm_module_exit); --=20 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org