[PATCH 05/36] HMM: introduce heterogeneous memory management v3.

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
       [not found] ` <1432236233-4035-1-git-send-email-j.glisse-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2015-05-21 19:23   ` j.glisse-Re5JQEeQqe8AvxtiuMwx3w
  0 siblings, 0 replies; 22+ messages in thread
From: j.glisse-Re5JQEeQqe8AvxtiuMwx3w @ 2015-05-21 19:23 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: Linus Torvalds, joro-zLv9SwRftAIdnm+yROfE0A, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier

From: Jérôme Glisse <jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

This patch only introduce core HMM functions for registering a new
mirror and stopping a mirror as well as HMM device registering and
unregistering.

The lifecycle of HMM object is handled differently then the one of
mmu_notifier because unlike mmu_notifier there can be concurrent
call from both mm code to HMM code and/or from device driver code
to HMM code. Moreover lifetime of HMM can be uncorrelated from the
lifetime of the process that is being mirror (GPU might take longer
time to cleanup).

Changed since v1:
  - Updated comment of hmm_device_register().

Changed since v2:
  - Expose struct hmm for easy access to mm struct.
  - Simplify hmm_mirror_register() arguments.
  - Removed the device name.
  - Refcount the mirror struct internaly to HMM allowing to get
    rid of the srcu and making the device driver callback error
    handling simpler.
  - Safe to call several time hmm_mirror_unregister().
  - Rework the mmu_notifier unregistration and release callback.

Signed-off-by: Jérôme Glisse <jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Sherry Cheung <SCheung-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Subhash Gutti <sgutti-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Mark Hairgrove <mhairgrove-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
Signed-off-by: John Hubbard <jhubbard-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Jatin Kumar <jakumar-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
cc: <linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
---
 MAINTAINERS              |   7 +
 include/linux/hmm.h      | 164 +++++++++++++++++++++
 include/linux/mm.h       |  11 ++
 include/linux/mm_types.h |  14 ++
 kernel/fork.c            |   2 +
 mm/Kconfig               |  15 ++
 mm/Makefile              |   1 +
 mm/hmm.c                 | 370 +++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 584 insertions(+)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 78ea7b6..2f2a2be 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4730,6 +4730,13 @@ F:	include/uapi/linux/if_hippi.h
 F:	net/802/hippi.c
 F:	drivers/net/hippi/
 
+HMM - Heterogeneous Memory Management
+M:	Jérôme Glisse <jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+L:	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
+S:	Maintained
+F:	mm/hmm.c
+F:	include/linux/hmm.h
+
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
 L:	hostap-I1pBevtZXGcAvxtiuMwx3w@public.gmane.org (subscribers-only)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 0000000..175a757
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,164 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+ */
+/* This is a heterogeneous memory management (hmm). In a nutshell this provide
+ * an API to mirror a process address on a device which has its own mmu using
+ * its own page table for the process. It supports everything except special
+ * vma.
+ *
+ * Mandatory hardware features :
+ *   - An mmu with pagetable.
+ *   - Read only flag per cpu page.
+ *   - Page fault ie hardware must stop and wait for kernel to service fault.
+ *
+ * Optional hardware features :
+ *   - Dirty bit per cpu page.
+ *   - Access bit per cpu page.
+ *
+ * The hmm code handle all the interfacing with the core kernel mm code and
+ * provide a simple API. It does support migrating system memory to device
+ * memory and handle migration back to system memory on cpu page fault.
+ *
+ * Migrated memory is considered as swaped from cpu and core mm code point of
+ * view.
+ */
+#ifndef _HMM_H
+#define _HMM_H
+
+#ifdef CONFIG_HMM
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
+#include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
+#include <linux/workqueue.h>
+#include <linux/mman.h>
+
+
+struct hmm_device;
+struct hmm_mirror;
+struct hmm;
+
+
+/* hmm_device - Each device must register one and only one hmm_device.
+ *
+ * The hmm_device is the link btw HMM and each device driver.
+ */
+
+/* struct hmm_device_operations - HMM device operation callback
+ */
+struct hmm_device_ops {
+	/* release() - mirror must stop using the address space.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 *
+	 * When this is call, device driver must kill all device thread using
+	 * this mirror. Also, this callback is the last thing call by HMM and
+	 * HMM will not access the mirror struct after this call (ie no more
+	 * dereference of it so it is safe for the device driver to free it).
+	 * It is call either from :
+	 *   - mm dying (all process using this mm exiting).
+	 *   - hmm_mirror_unregister() (if no other thread holds a reference)
+	 *   - outcome of some device error reported by any of the device
+	 *     callback against that mirror.
+	 */
+	void (*release)(struct hmm_mirror *mirror);
+};
+
+
+/* struct hmm - per mm_struct HMM states.
+ *
+ * @mm: The mm struct this hmm is associated with.
+ * @mirrors: List of all mirror for this mm (one per device).
+ * @vm_end: Last valid address for this mm (exclusive).
+ * @kref: Reference counter.
+ * @rwsem: Serialize the mirror list modifications.
+ * @mmu_notifier: The mmu_notifier of this mm.
+ * @rcu: For delayed cleanup call from mmu_notifier.release() callback.
+ *
+ * For each process address space (mm_struct) there is one and only one hmm
+ * struct. hmm functions will redispatch to each devices the change made to
+ * the process address space.
+ *
+ * Device driver must not access this structure other than for getting the
+ * mm pointer.
+ */
+struct hmm {
+	struct mm_struct	*mm;
+	struct hlist_head	mirrors;
+	unsigned long		vm_end;
+	struct kref		kref;
+	struct rw_semaphore	rwsem;
+	struct mmu_notifier	mmu_notifier;
+	struct rcu_head		rcu;
+};
+
+
+/* struct hmm_device - per device HMM structure
+ *
+ * @dev: Linux device structure pointer.
+ * @ops: The hmm operations callback.
+ * @mirrors: List of all active mirrors for the device.
+ * @mutex: Mutex protecting mirrors list.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct (only once per linux device).
+ */
+struct hmm_device {
+	struct device			*dev;
+	const struct hmm_device_ops	*ops;
+	struct list_head		mirrors;
+	struct mutex			mutex;
+};
+
+int hmm_device_register(struct hmm_device *device);
+int hmm_device_unregister(struct hmm_device *device);
+
+
+/* hmm_mirror - device specific mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct associating
+ * the process address space with the device. Same process can be mirrored by
+ * several different devices at the same time.
+ */
+
+/* struct hmm_mirror - per device and per mm HMM structure
+ *
+ * @device: The hmm_device struct this hmm_mirror is associated to.
+ * @hmm: The hmm struct this hmm_mirror is associated to.
+ * @kref: Reference counter (private to HMM do not use).
+ * @dlist: List of all hmm_mirror for same device.
+ * @mlist: List of all hmm_mirror for same process.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct for each of the address space it wants to mirror. Same device can
+ * mirror several different address space. As well same address space can be
+ * mirror by different devices.
+ */
+struct hmm_mirror {
+	struct hmm_device	*device;
+	struct hmm		*hmm;
+	struct kref		kref;
+	struct list_head	dlist;
+	struct hlist_node	mlist;
+};
+
+int hmm_mirror_register(struct hmm_mirror *mirror);
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+
+
+#endif /* CONFIG_HMM */
+#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2923a51..cf642d9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2199,5 +2199,16 @@ void __init setup_nr_node_ids(void);
 static inline void setup_nr_node_ids(void) {}
 #endif
 
+#ifdef CONFIG_HMM
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+	mm->hmm = NULL;
+}
+#else /* !CONFIG_HMM */
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+}
+#endif /* !CONFIG_HMM */
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0038ac7..4494f7f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -15,6 +15,10 @@
 #include <asm/page.h>
 #include <asm/mmu.h>
 
+#ifdef CONFIG_HMM
+struct hmm;
+#endif
+
 #ifndef AT_VECTOR_SIZE_ARCH
 #define AT_VECTOR_SIZE_ARCH 0
 #endif
@@ -451,6 +455,16 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_HMM
+	/*
+	 * hmm always register an mmu_notifier we rely on mmu notifier to keep
+	 * refcount on mm struct as well as forbiding registering hmm on a
+	 * dying mm
+	 *
+	 * This field is set with mmap_sem old in write mode.
+	 */
+	struct hmm *hmm;
+#endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 0e0ae9a..4083be7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -27,6 +27,7 @@
 #include <linux/binfmts.h>
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
@@ -597,6 +598,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
 	mmu_notifier_mm_init(mm);
+	hmm_mm_init(mm);
 	clear_tlb_flush_pending(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
diff --git a/mm/Kconfig b/mm/Kconfig
index 52ffb86..189e48f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -653,3 +653,18 @@ config DEFERRED_STRUCT_PAGE_INIT
 	  when kswapd starts. This has a potential performance impact on
 	  processes running early in the lifetime of the systemm until kswapd
 	  finishes the initialisation.
+
+if STAGING
+config HMM
+	bool "Enable heterogeneous memory management (HMM)"
+	depends on MMU
+	select MMU_NOTIFIER
+	select GENERIC_PAGE_TABLE
+	default n
+	help
+	  Heterogeneous memory management provide infrastructure for a device
+	  to mirror a process address space into an hardware mmu or into any
+	  things supporting pagefault like event.
+
+	  If unsure, say N to disable hmm.
+endif # STAGING
diff --git a/mm/Makefile b/mm/Makefile
index 98c4eae..90ca9c4 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -78,3 +78,4 @@ obj-$(CONFIG_CMA)	+= cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
 obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
 obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
+obj-$(CONFIG_HMM) += hmm.o
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 0000000..e684dd0
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,370 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+ */
+/* This is the core code for heterogeneous memory management (HMM). HMM intend
+ * to provide helper for mirroring a process address space on a device as well
+ * as allowing migration of data between system memory and device memory refer
+ * as remote memory from here on out.
+ *
+ * Refer to include/linux/hmm.h for further information on general design.
+ */
+#include <linux/export.h>
+#include <linux/bitmap.h>
+#include <linux/list.h>
+#include <linux/rculist.h>
+#include <linux/slab.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/ksm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/mmu_context.h>
+#include <linux/memcontrol.h>
+#include <linux/hmm.h>
+#include <linux/wait.h>
+#include <linux/mman.h>
+#include <linux/delay.h>
+#include <linux/workqueue.h>
+
+#include "internal.h"
+
+static struct mmu_notifier_ops hmm_notifier_ops;
+
+static inline struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror);
+static inline void hmm_mirror_unref(struct hmm_mirror **mirror);
+
+
+/* hmm - core HMM functions.
+ *
+ * Core HMM functions that deal with all the process mm activities.
+ */
+
+static int hmm_init(struct hmm *hmm)
+{
+	hmm->mm = current->mm;
+	hmm->vm_end = TASK_SIZE;
+	kref_init(&hmm->kref);
+	INIT_HLIST_HEAD(&hmm->mirrors);
+	init_rwsem(&hmm->rwsem);
+
+	/* register notifier */
+	hmm->mmu_notifier.ops = &hmm_notifier_ops;
+	return __mmu_notifier_register(&hmm->mmu_notifier, current->mm);
+}
+
+static int hmm_add_mirror(struct hmm *hmm, struct hmm_mirror *mirror)
+{
+	struct hmm_mirror *tmp;
+
+	down_write(&hmm->rwsem);
+	hlist_for_each_entry(tmp, &hmm->mirrors, mlist)
+		if (tmp->device == mirror->device) {
+			/* Same device can mirror only once. */
+			up_write(&hmm->rwsem);
+			return -EINVAL;
+		}
+	hlist_add_head(&mirror->mlist, &hmm->mirrors);
+	hmm_mirror_ref(mirror);
+	up_write(&hmm->rwsem);
+
+	return 0;
+}
+
+static inline struct hmm *hmm_ref(struct hmm *hmm)
+{
+	if (!hmm || !kref_get_unless_zero(&hmm->kref))
+		return NULL;
+	return hmm;
+}
+
+static void hmm_destroy_delayed(struct rcu_head *rcu)
+{
+	struct hmm *hmm;
+
+	hmm = container_of(rcu, struct hmm, rcu);
+	kfree(hmm);
+}
+
+static void hmm_destroy(struct kref *kref)
+{
+	struct hmm *hmm;
+
+	hmm = container_of(kref, struct hmm, kref);
+	BUG_ON(!hlist_empty(&hmm->mirrors));
+
+	down_write(&hmm->mm->mmap_sem);
+	/* A new hmm might have been register before reaching that point. */
+	if (hmm->mm->hmm == hmm)
+		hmm->mm->hmm = NULL;
+	up_write(&hmm->mm->mmap_sem);
+
+	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, hmm->mm);
+
+	mmu_notifier_call_srcu(&hmm->rcu, &hmm_destroy_delayed);
+}
+
+static inline struct hmm *hmm_unref(struct hmm *hmm)
+{
+	if (hmm)
+		kref_put(&hmm->kref, hmm_destroy);
+	return NULL;
+}
+
+
+/* hmm_notifier - HMM callback for mmu_notifier tracking change to process mm.
+ *
+ * HMM use use mmu notifier to track change made to process address space.
+ */
+static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct hmm *hmm;
+
+	hmm = hmm_ref(container_of(mn, struct hmm, mmu_notifier));
+	if (!hmm)
+		return;
+
+	down_write(&hmm->rwsem);
+	while (hmm->mirrors.first) {
+		struct hmm_mirror *mirror;
+
+		/*
+		 * Here we are holding the mirror reference from the mirror
+		 * list. As list removal is synchronized through spinlock, no
+		 * other thread can assume it holds that reference.
+		 */
+		mirror = hlist_entry(hmm->mirrors.first,
+				     struct hmm_mirror,
+				     mlist);
+		hlist_del_init(&mirror->mlist);
+		up_write(&hmm->rwsem);
+
+		hmm_mirror_unref(&mirror);
+
+		down_write(&hmm->rwsem);
+	}
+	up_write(&hmm->rwsem);
+
+	hmm_unref(hmm);
+}
+
+static struct mmu_notifier_ops hmm_notifier_ops = {
+	.release		= hmm_notifier_release,
+};
+
+
+/* hmm_mirror - per device mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct. A process
+ * can be mirror by several devices at the same time.
+ *
+ * Below are all the functions and their helpers use by device driver to mirror
+ * the process address space. Those functions either deals with updating the
+ * device page table (through hmm callback). Or provide helper functions use by
+ * the device driver to fault in range of memory in the device page table.
+ */
+static inline struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror)
+{
+	if (!mirror || !kref_get_unless_zero(&mirror->kref))
+		return NULL;
+	return mirror;
+}
+
+static void hmm_mirror_destroy(struct kref *kref)
+{
+	struct hmm_device *device;
+	struct hmm_mirror *mirror;
+	struct hmm *hmm;
+
+	mirror = container_of(kref, struct hmm_mirror, kref);
+	device = mirror->device;
+	hmm = mirror->hmm;
+
+	mutex_lock(&device->mutex);
+	list_del_init(&mirror->dlist);
+	device->ops->release(mirror);
+	mutex_unlock(&device->mutex);
+}
+
+static inline void hmm_mirror_unref(struct hmm_mirror **mirror)
+{
+	struct hmm_mirror *tmp = mirror ? *mirror : NULL;
+
+	if (tmp) {
+		*mirror = NULL;
+		kref_put(&tmp->kref, hmm_mirror_destroy);
+	}
+}
+
+/* hmm_mirror_register() - register mirror against current process for a device.
+ *
+ * @mirror: The mirror struct being registered.
+ * Returns: 0 on success or -ENOMEM, -EINVAL on error.
+ *
+ * Call when device driver want to start mirroring a process address space. The
+ * HMM shim will register mmu_notifier and start monitoring process address
+ * space changes. Hence callback to device driver might happen even before this
+ * function return.
+ *
+ * The task device driver want to mirror must be current !
+ *
+ * Only one mirror per mm and hmm_device can be created, it will return NULL if
+ * the hmm_device already has an hmm_mirror for the the mm.
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror)
+{
+	struct mm_struct *mm = current->mm;
+	struct hmm *hmm = NULL;
+	int ret = 0;
+
+	/* Sanity checks. */
+	BUG_ON(!mirror);
+	BUG_ON(!mirror->device);
+	BUG_ON(!mm);
+
+	/*
+	 * Initialize the mirror struct fields, the mlist init and del dance is
+	 * necessary to make the error path easier for driver and for hmm.
+	 */
+	kref_init(&mirror->kref);
+	INIT_HLIST_NODE(&mirror->mlist);
+	INIT_LIST_HEAD(&mirror->dlist);
+	mutex_lock(&mirror->device->mutex);
+	list_add(&mirror->dlist, &mirror->device->mirrors);
+	mutex_unlock(&mirror->device->mutex);
+
+	down_write(&mm->mmap_sem);
+
+	hmm = mm->hmm ? hmm_ref(hmm) : NULL;
+	if (hmm == NULL) {
+		/* no hmm registered yet so register one */
+		hmm = kzalloc(sizeof(*mm->hmm), GFP_KERNEL);
+		if (hmm == NULL) {
+			up_write(&mm->mmap_sem);
+			ret = -ENOMEM;
+			goto error;
+		}
+
+		ret = hmm_init(hmm);
+		if (ret) {
+			up_write(&mm->mmap_sem);
+			kfree(hmm);
+			goto error;
+		}
+
+		mm->hmm = hmm;
+	}
+
+	mirror->hmm = hmm;
+	ret = hmm_add_mirror(hmm, mirror);
+	up_write(&mm->mmap_sem);
+	if (ret) {
+		mirror->hmm = NULL;
+		hmm_unref(hmm);
+		goto error;
+	}
+	return 0;
+
+error:
+	mutex_lock(&mirror->device->mutex);
+	list_del_init(&mirror->dlist);
+	mutex_unlock(&mirror->device->mutex);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+static void hmm_mirror_kill(struct hmm_mirror *mirror)
+{
+	down_write(&mirror->hmm->rwsem);
+	if (!hlist_unhashed(&mirror->mlist)) {
+		hlist_del_init(&mirror->mlist);
+		up_write(&mirror->hmm->rwsem);
+
+		hmm_mirror_unref(&mirror);
+	} else
+		up_write(&mirror->hmm->rwsem);
+}
+
+/* hmm_mirror_unregister() - unregister a mirror.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ *
+ * Driver can call this function when it wants to stop mirroring a process.
+ * This will trigger a call to the ->release() callback if it did not aleady
+ * happen.
+ *
+ * Note that caller must hold a reference on the mirror.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+	if (mirror == NULL)
+		return;
+
+	hmm_mirror_kill(mirror);
+	hmm_mirror_unref(&mirror);
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
+
+
+/* hmm_device - Each device driver must register one and only one hmm_device
+ *
+ * The hmm_device is the link btw HMM and each device driver.
+ */
+
+/* hmm_device_register() - register a device with HMM.
+ *
+ * @device: The hmm_device struct.
+ * Returns: 0 on success or -EINVAL otherwise.
+ *
+ *
+ * Call when device driver want to register itself with HMM. Device driver must
+ * only register once.
+ */
+int hmm_device_register(struct hmm_device *device)
+{
+	/* sanity check */
+	BUG_ON(!device);
+	BUG_ON(!device->ops);
+	BUG_ON(!device->ops->release);
+
+	mutex_init(&device->mutex);
+	INIT_LIST_HEAD(&device->mirrors);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_device_register);
+
+/* hmm_device_unregister() - unregister a device with HMM.
+ *
+ * @device: The hmm_device struct.
+ * Returns: 0 on success or -EBUSY otherwise.
+ *
+ * Call when device driver want to unregister itself with HMM. This will check
+ * that there is no any active mirror and returns -EBUSY if so.
+ */
+int hmm_device_unregister(struct hmm_device *device)
+{
+	mutex_lock(&device->mutex);
+	if (!list_empty(&device->mirrors)) {
+		mutex_unlock(&device->mutex);
+		return -EBUSY;
+	}
+	mutex_unlock(&device->mutex);
+	return 0;
+}
+EXPORT_SYMBOL(hmm_device_unregister);
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
       [not found] <1432236705-4209-1-git-send-email-j.glisse@gmail.com>
@ 2015-05-21 19:31 ` j.glisse
  2015-05-27  5:50   ` Aneesh Kumar K.V
  2015-06-08 19:40   ` Mark Hairgrove
       [not found] ` <1432239792-5002-1-git-send-email-jglisse@redhat.com>
  1 sibling, 2 replies; 22+ messages in thread
From: j.glisse @ 2015-05-21 19:31 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel,
	Liran Liss <liran>

From: Jérôme Glisse <jglisse@redhat.com>

This patch only introduce core HMM functions for registering a new
mirror and stopping a mirror as well as HMM device registering and
unregistering.

The lifecycle of HMM object is handled differently then the one of
mmu_notifier because unlike mmu_notifier there can be concurrent
call from both mm code to HMM code and/or from device driver code
to HMM code. Moreover lifetime of HMM can be uncorrelated from the
lifetime of the process that is being mirror (GPU might take longer
time to cleanup).

Changed since v1:
  - Updated comment of hmm_device_register().

Changed since v2:
  - Expose struct hmm for easy access to mm struct.
  - Simplify hmm_mirror_register() arguments.
  - Removed the device name.
  - Refcount the mirror struct internaly to HMM allowing to get
    rid of the srcu and making the device driver callback error
    handling simpler.
  - Safe to call several time hmm_mirror_unregister().
  - Rework the mmu_notifier unregistration and release callback.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
cc: <linux-rdma@vger.kernel.org>
---
 MAINTAINERS              |   7 +
 include/linux/hmm.h      | 164 +++++++++++++++++++++
 include/linux/mm.h       |  11 ++
 include/linux/mm_types.h |  14 ++
 kernel/fork.c            |   2 +
 mm/Kconfig               |  15 ++
 mm/Makefile              |   1 +
 mm/hmm.c                 | 370 +++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 584 insertions(+)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 78ea7b6..2f2a2be 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4730,6 +4730,13 @@ F:	include/uapi/linux/if_hippi.h
 F:	net/802/hippi.c
 F:	drivers/net/hippi/
 
+HMM - Heterogeneous Memory Management
+M:	Jérôme Glisse <jglisse@redhat.com>
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	mm/hmm.c
+F:	include/linux/hmm.h
+
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
 L:	hostap@shmoo.com (subscribers-only)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 0000000..175a757
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,164 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/* This is a heterogeneous memory management (hmm). In a nutshell this provide
+ * an API to mirror a process address on a device which has its own mmu using
+ * its own page table for the process. It supports everything except special
+ * vma.
+ *
+ * Mandatory hardware features :
+ *   - An mmu with pagetable.
+ *   - Read only flag per cpu page.
+ *   - Page fault ie hardware must stop and wait for kernel to service fault.
+ *
+ * Optional hardware features :
+ *   - Dirty bit per cpu page.
+ *   - Access bit per cpu page.
+ *
+ * The hmm code handle all the interfacing with the core kernel mm code and
+ * provide a simple API. It does support migrating system memory to device
+ * memory and handle migration back to system memory on cpu page fault.
+ *
+ * Migrated memory is considered as swaped from cpu and core mm code point of
+ * view.
+ */
+#ifndef _HMM_H
+#define _HMM_H
+
+#ifdef CONFIG_HMM
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
+#include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
+#include <linux/workqueue.h>
+#include <linux/mman.h>
+
+
+struct hmm_device;
+struct hmm_mirror;
+struct hmm;
+
+
+/* hmm_device - Each device must register one and only one hmm_device.
+ *
+ * The hmm_device is the link btw HMM and each device driver.
+ */
+
+/* struct hmm_device_operations - HMM device operation callback
+ */
+struct hmm_device_ops {
+	/* release() - mirror must stop using the address space.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 *
+	 * When this is call, device driver must kill all device thread using
+	 * this mirror. Also, this callback is the last thing call by HMM and
+	 * HMM will not access the mirror struct after this call (ie no more
+	 * dereference of it so it is safe for the device driver to free it).
+	 * It is call either from :
+	 *   - mm dying (all process using this mm exiting).
+	 *   - hmm_mirror_unregister() (if no other thread holds a reference)
+	 *   - outcome of some device error reported by any of the device
+	 *     callback against that mirror.
+	 */
+	void (*release)(struct hmm_mirror *mirror);
+};
+
+
+/* struct hmm - per mm_struct HMM states.
+ *
+ * @mm: The mm struct this hmm is associated with.
+ * @mirrors: List of all mirror for this mm (one per device).
+ * @vm_end: Last valid address for this mm (exclusive).
+ * @kref: Reference counter.
+ * @rwsem: Serialize the mirror list modifications.
+ * @mmu_notifier: The mmu_notifier of this mm.
+ * @rcu: For delayed cleanup call from mmu_notifier.release() callback.
+ *
+ * For each process address space (mm_struct) there is one and only one hmm
+ * struct. hmm functions will redispatch to each devices the change made to
+ * the process address space.
+ *
+ * Device driver must not access this structure other than for getting the
+ * mm pointer.
+ */
+struct hmm {
+	struct mm_struct	*mm;
+	struct hlist_head	mirrors;
+	unsigned long		vm_end;
+	struct kref		kref;
+	struct rw_semaphore	rwsem;
+	struct mmu_notifier	mmu_notifier;
+	struct rcu_head		rcu;
+};
+
+
+/* struct hmm_device - per device HMM structure
+ *
+ * @dev: Linux device structure pointer.
+ * @ops: The hmm operations callback.
+ * @mirrors: List of all active mirrors for the device.
+ * @mutex: Mutex protecting mirrors list.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct (only once per linux device).
+ */
+struct hmm_device {
+	struct device			*dev;
+	const struct hmm_device_ops	*ops;
+	struct list_head		mirrors;
+	struct mutex			mutex;
+};
+
+int hmm_device_register(struct hmm_device *device);
+int hmm_device_unregister(struct hmm_device *device);
+
+
+/* hmm_mirror - device specific mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct associating
+ * the process address space with the device. Same process can be mirrored by
+ * several different devices at the same time.
+ */
+
+/* struct hmm_mirror - per device and per mm HMM structure
+ *
+ * @device: The hmm_device struct this hmm_mirror is associated to.
+ * @hmm: The hmm struct this hmm_mirror is associated to.
+ * @kref: Reference counter (private to HMM do not use).
+ * @dlist: List of all hmm_mirror for same device.
+ * @mlist: List of all hmm_mirror for same process.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct for each of the address space it wants to mirror. Same device can
+ * mirror several different address space. As well same address space can be
+ * mirror by different devices.
+ */
+struct hmm_mirror {
+	struct hmm_device	*device;
+	struct hmm		*hmm;
+	struct kref		kref;
+	struct list_head	dlist;
+	struct hlist_node	mlist;
+};
+
+int hmm_mirror_register(struct hmm_mirror *mirror);
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+
+
+#endif /* CONFIG_HMM */
+#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2923a51..cf642d9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2199,5 +2199,16 @@ void __init setup_nr_node_ids(void);
 static inline void setup_nr_node_ids(void) {}
 #endif
 
+#ifdef CONFIG_HMM
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+	mm->hmm = NULL;
+}
+#else /* !CONFIG_HMM */
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+}
+#endif /* !CONFIG_HMM */
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0038ac7..4494f7f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -15,6 +15,10 @@
 #include <asm/page.h>
 #include <asm/mmu.h>
 
+#ifdef CONFIG_HMM
+struct hmm;
+#endif
+
 #ifndef AT_VECTOR_SIZE_ARCH
 #define AT_VECTOR_SIZE_ARCH 0
 #endif
@@ -451,6 +455,16 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_HMM
+	/*
+	 * hmm always register an mmu_notifier we rely on mmu notifier to keep
+	 * refcount on mm struct as well as forbiding registering hmm on a
+	 * dying mm
+	 *
+	 * This field is set with mmap_sem old in write mode.
+	 */
+	struct hmm *hmm;
+#endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 0e0ae9a..4083be7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -27,6 +27,7 @@
 #include <linux/binfmts.h>
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
@@ -597,6 +598,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
 	mmu_notifier_mm_init(mm);
+	hmm_mm_init(mm);
 	clear_tlb_flush_pending(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
diff --git a/mm/Kconfig b/mm/Kconfig
index 52ffb86..189e48f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -653,3 +653,18 @@ config DEFERRED_STRUCT_PAGE_INIT
 	  when kswapd starts. This has a potential performance impact on
 	  processes running early in the lifetime of the systemm until kswapd
 	  finishes the initialisation.
+
+if STAGING
+config HMM
+	bool "Enable heterogeneous memory management (HMM)"
+	depends on MMU
+	select MMU_NOTIFIER
+	select GENERIC_PAGE_TABLE
+	default n
+	help
+	  Heterogeneous memory management provide infrastructure for a device
+	  to mirror a process address space into an hardware mmu or into any
+	  things supporting pagefault like event.
+
+	  If unsure, say N to disable hmm.
+endif # STAGING
diff --git a/mm/Makefile b/mm/Makefile
index 98c4eae..90ca9c4 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -78,3 +78,4 @@ obj-$(CONFIG_CMA)	+= cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
 obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
 obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
+obj-$(CONFIG_HMM) += hmm.o
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 0000000..e684dd0
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,370 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/* This is the core code for heterogeneous memory management (HMM). HMM intend
+ * to provide helper for mirroring a process address space on a device as well
+ * as allowing migration of data between system memory and device memory refer
+ * as remote memory from here on out.
+ *
+ * Refer to include/linux/hmm.h for further information on general design.
+ */
+#include <linux/export.h>
+#include <linux/bitmap.h>
+#include <linux/list.h>
+#include <linux/rculist.h>
+#include <linux/slab.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/ksm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/mmu_context.h>
+#include <linux/memcontrol.h>
+#include <linux/hmm.h>
+#include <linux/wait.h>
+#include <linux/mman.h>
+#include <linux/delay.h>
+#include <linux/workqueue.h>
+
+#include "internal.h"
+
+static struct mmu_notifier_ops hmm_notifier_ops;
+
+static inline struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror);
+static inline void hmm_mirror_unref(struct hmm_mirror **mirror);
+
+
+/* hmm - core HMM functions.
+ *
+ * Core HMM functions that deal with all the process mm activities.
+ */
+
+static int hmm_init(struct hmm *hmm)
+{
+	hmm->mm = current->mm;
+	hmm->vm_end = TASK_SIZE;
+	kref_init(&hmm->kref);
+	INIT_HLIST_HEAD(&hmm->mirrors);
+	init_rwsem(&hmm->rwsem);
+
+	/* register notifier */
+	hmm->mmu_notifier.ops = &hmm_notifier_ops;
+	return __mmu_notifier_register(&hmm->mmu_notifier, current->mm);
+}
+
+static int hmm_add_mirror(struct hmm *hmm, struct hmm_mirror *mirror)
+{
+	struct hmm_mirror *tmp;
+
+	down_write(&hmm->rwsem);
+	hlist_for_each_entry(tmp, &hmm->mirrors, mlist)
+		if (tmp->device == mirror->device) {
+			/* Same device can mirror only once. */
+			up_write(&hmm->rwsem);
+			return -EINVAL;
+		}
+	hlist_add_head(&mirror->mlist, &hmm->mirrors);
+	hmm_mirror_ref(mirror);
+	up_write(&hmm->rwsem);
+
+	return 0;
+}
+
+static inline struct hmm *hmm_ref(struct hmm *hmm)
+{
+	if (!hmm || !kref_get_unless_zero(&hmm->kref))
+		return NULL;
+	return hmm;
+}
+
+static void hmm_destroy_delayed(struct rcu_head *rcu)
+{
+	struct hmm *hmm;
+
+	hmm = container_of(rcu, struct hmm, rcu);
+	kfree(hmm);
+}
+
+static void hmm_destroy(struct kref *kref)
+{
+	struct hmm *hmm;
+
+	hmm = container_of(kref, struct hmm, kref);
+	BUG_ON(!hlist_empty(&hmm->mirrors));
+
+	down_write(&hmm->mm->mmap_sem);
+	/* A new hmm might have been register before reaching that point. */
+	if (hmm->mm->hmm == hmm)
+		hmm->mm->hmm = NULL;
+	up_write(&hmm->mm->mmap_sem);
+
+	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, hmm->mm);
+
+	mmu_notifier_call_srcu(&hmm->rcu, &hmm_destroy_delayed);
+}
+
+static inline struct hmm *hmm_unref(struct hmm *hmm)
+{
+	if (hmm)
+		kref_put(&hmm->kref, hmm_destroy);
+	return NULL;
+}
+
+
+/* hmm_notifier - HMM callback for mmu_notifier tracking change to process mm.
+ *
+ * HMM use use mmu notifier to track change made to process address space.
+ */
+static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct hmm *hmm;
+
+	hmm = hmm_ref(container_of(mn, struct hmm, mmu_notifier));
+	if (!hmm)
+		return;
+
+	down_write(&hmm->rwsem);
+	while (hmm->mirrors.first) {
+		struct hmm_mirror *mirror;
+
+		/*
+		 * Here we are holding the mirror reference from the mirror
+		 * list. As list removal is synchronized through spinlock, no
+		 * other thread can assume it holds that reference.
+		 */
+		mirror = hlist_entry(hmm->mirrors.first,
+				     struct hmm_mirror,
+				     mlist);
+		hlist_del_init(&mirror->mlist);
+		up_write(&hmm->rwsem);
+
+		hmm_mirror_unref(&mirror);
+
+		down_write(&hmm->rwsem);
+	}
+	up_write(&hmm->rwsem);
+
+	hmm_unref(hmm);
+}
+
+static struct mmu_notifier_ops hmm_notifier_ops = {
+	.release		= hmm_notifier_release,
+};
+
+
+/* hmm_mirror - per device mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct. A process
+ * can be mirror by several devices at the same time.
+ *
+ * Below are all the functions and their helpers use by device driver to mirror
+ * the process address space. Those functions either deals with updating the
+ * device page table (through hmm callback). Or provide helper functions use by
+ * the device driver to fault in range of memory in the device page table.
+ */
+static inline struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror)
+{
+	if (!mirror || !kref_get_unless_zero(&mirror->kref))
+		return NULL;
+	return mirror;
+}
+
+static void hmm_mirror_destroy(struct kref *kref)
+{
+	struct hmm_device *device;
+	struct hmm_mirror *mirror;
+	struct hmm *hmm;
+
+	mirror = container_of(kref, struct hmm_mirror, kref);
+	device = mirror->device;
+	hmm = mirror->hmm;
+
+	mutex_lock(&device->mutex);
+	list_del_init(&mirror->dlist);
+	device->ops->release(mirror);
+	mutex_unlock(&device->mutex);
+}
+
+static inline void hmm_mirror_unref(struct hmm_mirror **mirror)
+{
+	struct hmm_mirror *tmp = mirror ? *mirror : NULL;
+
+	if (tmp) {
+		*mirror = NULL;
+		kref_put(&tmp->kref, hmm_mirror_destroy);
+	}
+}
+
+/* hmm_mirror_register() - register mirror against current process for a device.
+ *
+ * @mirror: The mirror struct being registered.
+ * Returns: 0 on success or -ENOMEM, -EINVAL on error.
+ *
+ * Call when device driver want to start mirroring a process address space. The
+ * HMM shim will register mmu_notifier and start monitoring process address
+ * space changes. Hence callback to device driver might happen even before this
+ * function return.
+ *
+ * The task device driver want to mirror must be current !
+ *
+ * Only one mirror per mm and hmm_device can be created, it will return NULL if
+ * the hmm_device already has an hmm_mirror for the the mm.
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror)
+{
+	struct mm_struct *mm = current->mm;
+	struct hmm *hmm = NULL;
+	int ret = 0;
+
+	/* Sanity checks. */
+	BUG_ON(!mirror);
+	BUG_ON(!mirror->device);
+	BUG_ON(!mm);
+
+	/*
+	 * Initialize the mirror struct fields, the mlist init and del dance is
+	 * necessary to make the error path easier for driver and for hmm.
+	 */
+	kref_init(&mirror->kref);
+	INIT_HLIST_NODE(&mirror->mlist);
+	INIT_LIST_HEAD(&mirror->dlist);
+	mutex_lock(&mirror->device->mutex);
+	list_add(&mirror->dlist, &mirror->device->mirrors);
+	mutex_unlock(&mirror->device->mutex);
+
+	down_write(&mm->mmap_sem);
+
+	hmm = mm->hmm ? hmm_ref(hmm) : NULL;
+	if (hmm == NULL) {
+		/* no hmm registered yet so register one */
+		hmm = kzalloc(sizeof(*mm->hmm), GFP_KERNEL);
+		if (hmm == NULL) {
+			up_write(&mm->mmap_sem);
+			ret = -ENOMEM;
+			goto error;
+		}
+
+		ret = hmm_init(hmm);
+		if (ret) {
+			up_write(&mm->mmap_sem);
+			kfree(hmm);
+			goto error;
+		}
+
+		mm->hmm = hmm;
+	}
+
+	mirror->hmm = hmm;
+	ret = hmm_add_mirror(hmm, mirror);
+	up_write(&mm->mmap_sem);
+	if (ret) {
+		mirror->hmm = NULL;
+		hmm_unref(hmm);
+		goto error;
+	}
+	return 0;
+
+error:
+	mutex_lock(&mirror->device->mutex);
+	list_del_init(&mirror->dlist);
+	mutex_unlock(&mirror->device->mutex);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+static void hmm_mirror_kill(struct hmm_mirror *mirror)
+{
+	down_write(&mirror->hmm->rwsem);
+	if (!hlist_unhashed(&mirror->mlist)) {
+		hlist_del_init(&mirror->mlist);
+		up_write(&mirror->hmm->rwsem);
+
+		hmm_mirror_unref(&mirror);
+	} else
+		up_write(&mirror->hmm->rwsem);
+}
+
+/* hmm_mirror_unregister() - unregister a mirror.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ *
+ * Driver can call this function when it wants to stop mirroring a process.
+ * This will trigger a call to the ->release() callback if it did not aleady
+ * happen.
+ *
+ * Note that caller must hold a reference on the mirror.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+	if (mirror == NULL)
+		return;
+
+	hmm_mirror_kill(mirror);
+	hmm_mirror_unref(&mirror);
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
+
+
+/* hmm_device - Each device driver must register one and only one hmm_device
+ *
+ * The hmm_device is the link btw HMM and each device driver.
+ */
+
+/* hmm_device_register() - register a device with HMM.
+ *
+ * @device: The hmm_device struct.
+ * Returns: 0 on success or -EINVAL otherwise.
+ *
+ *
+ * Call when device driver want to register itself with HMM. Device driver must
+ * only register once.
+ */
+int hmm_device_register(struct hmm_device *device)
+{
+	/* sanity check */
+	BUG_ON(!device);
+	BUG_ON(!device->ops);
+	BUG_ON(!device->ops->release);
+
+	mutex_init(&device->mutex);
+	INIT_LIST_HEAD(&device->mirrors);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_device_register);
+
+/* hmm_device_unregister() - unregister a device with HMM.
+ *
+ * @device: The hmm_device struct.
+ * Returns: 0 on success or -EBUSY otherwise.
+ *
+ * Call when device driver want to unregister itself with HMM. This will check
+ * that there is no any active mirror and returns -EBUSY if so.
+ */
+int hmm_device_unregister(struct hmm_device *device)
+{
+	mutex_lock(&device->mutex);
+	if (!list_empty(&device->mirrors)) {
+		mutex_unlock(&device->mutex);
+		return -EBUSY;
+	}
+	mutex_unlock(&device->mutex);
+	return 0;
+}
+EXPORT_SYMBOL(hmm_device_unregister);
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 30/36] IB/mlx5: add a new paramter to mlx5_ib_update_mtt() for ODP with HMM.
       [not found] ` <1432239792-5002-1-git-send-email-jglisse@redhat.com>
@ 2015-05-21 20:23   ` jglisse
  2015-05-21 20:23   ` [PATCH 31/36] IB/odp: export rbt_ib_umem_for_each_in_range() jglisse
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: jglisse @ 2015-05-21 20:23 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel,
	Liran Liss <liranl@

From: Jérôme Glisse <jglisse@redhat.com>

When using HMM for ODP it will be usefull to pass the current mirror
page table iterator for mlx5_ib_update_mtt() function benefit. Add
void parameter for this.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
cc: <linux-rdma@vger.kernel.org>
---
 drivers/infiniband/hw/mlx5/mlx5_ib.h | 2 +-
 drivers/infiniband/hw/mlx5/mr.c      | 4 ++--
 drivers/infiniband/hw/mlx5/odp.c     | 8 +++++---
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index ec532f0..ec629f2 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -569,7 +569,7 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 				  u64 virt_addr, int access_flags,
 				  struct ib_udata *udata);
 int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index,
-		       int npages, int zap);
+		       int npages, int zap, void *data);
 int mlx5_ib_dereg_mr(struct ib_mr *ibmr);
 int mlx5_ib_destroy_mr(struct ib_mr *ibmr);
 struct ib_mr *mlx5_ib_create_mr(struct ib_pd *pd,
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 51a7775..759ed15 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -845,7 +845,7 @@ free_mr:
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
 int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index, int npages,
-		       int zap)
+		       int zap, void *data)
 {
 	struct mlx5_ib_dev *dev = mr->dev;
 	struct device *ddev = dev->ib_dev.dma_device;
@@ -912,7 +912,7 @@ int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index, int npages,
 		if (!zap) {
 			__mlx5_ib_populate_pas(dev, umem, PAGE_SHIFT,
 					       start_page_index, npages, pas,
-					       MLX5_IB_MTT_PRESENT, NULL);
+					       MLX5_IB_MTT_PRESENT, data);
 			/* Clear padding after the pages brought from the
 			 * umem. */
 			memset(pas + npages, 0, size - npages * sizeof(u64));
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 5099db0..5171959 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -91,14 +91,15 @@ void mlx5_ib_invalidate_range(struct ib_umem *umem, unsigned long start,
 
 			if (in_block && umr_offset == 0) {
 				mlx5_ib_update_mtt(mr, blk_start_idx,
-						   idx - blk_start_idx, 1);
+						   idx - blk_start_idx, 1,
+						   NULL);
 				in_block = 0;
 			}
 		}
 	}
 	if (in_block)
 		mlx5_ib_update_mtt(mr, blk_start_idx, idx - blk_start_idx + 1,
-				   1);
+				   1, NULL);
 
 	/*
 	 * We are now sure that the device will not access the
@@ -256,7 +257,8 @@ static int pagefault_single_data_segment(struct mlx5_ib_qp *qp,
 			 * this MR, since ib_umem_odp_map_dma_pages already
 			 * checks this.
 			 */
-			ret = mlx5_ib_update_mtt(mr, start_idx, npages, 0);
+			ret = mlx5_ib_update_mtt(mr, start_idx,
+						 npages, 0, NULL);
 		} else {
 			ret = -EAGAIN;
 		}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 31/36] IB/odp: export rbt_ib_umem_for_each_in_range()
       [not found] ` <1432239792-5002-1-git-send-email-jglisse@redhat.com>
  2015-05-21 20:23   ` [PATCH 30/36] IB/mlx5: add a new paramter to mlx5_ib_update_mtt() for ODP with HMM jglisse
@ 2015-05-21 20:23   ` jglisse
  2015-05-21 20:23   ` [PATCH 32/36] IB/odp/hmm: add new kernel option to use HMM for ODP jglisse
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: jglisse @ 2015-05-21 20:23 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan

From: Jérôme Glisse <jglisse@redhat.com>

The mlx5 driver will need this function for its driver specific bit
of ODP (on demand paging) on HMM (Heterogeneous Memory Management).

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
cc: <linux-rdma@vger.kernel.org>
---
 drivers/infiniband/core/umem_rbtree.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/infiniband/core/umem_rbtree.c b/drivers/infiniband/core/umem_rbtree.c
index 727d788..f030ec0 100644
--- a/drivers/infiniband/core/umem_rbtree.c
+++ b/drivers/infiniband/core/umem_rbtree.c
@@ -92,3 +92,4 @@ int rbt_ib_umem_for_each_in_range(struct rb_root *root,
 
 	return ret_val;
 }
+EXPORT_SYMBOL(rbt_ib_umem_for_each_in_range);
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 32/36] IB/odp/hmm: add new kernel option to use HMM for ODP.
       [not found] ` <1432239792-5002-1-git-send-email-jglisse@redhat.com>
  2015-05-21 20:23   ` [PATCH 30/36] IB/mlx5: add a new paramter to mlx5_ib_update_mtt() for ODP with HMM jglisse
  2015-05-21 20:23   ` [PATCH 31/36] IB/odp: export rbt_ib_umem_for_each_in_range() jglisse
@ 2015-05-21 20:23   ` jglisse
  2015-05-21 20:23   ` [PATCH 33/36] IB/odp/hmm: add core infiniband structure and helper for ODP with HMM jglisse
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: jglisse @ 2015-05-21 20:23 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan

From: Jérôme Glisse <jglisse@redhat.com>

This is a preparatory patch for HMM implementation of ODP (on demand
paging). It introduce a new configure option and add proper build
time conditional code section. Enabling INFINIBAND_ON_DEMAND_PAGING_HMM
will result in build error with this patch.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
cc: <linux-rdma@vger.kernel.org>
---
 drivers/infiniband/Kconfig                   |  10 ++
 drivers/infiniband/core/umem_odp.c           |   4 +
 drivers/infiniband/core/uverbs_cmd.c         |  17 +++-
 drivers/infiniband/hw/mlx5/main.c            |  10 +-
 drivers/infiniband/hw/mlx5/mem.c             |   8 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h         |   9 +-
 drivers/infiniband/hw/mlx5/mr.c              |  10 +-
 drivers/infiniband/hw/mlx5/odp.c             | 135 ++++++++++++++-------------
 drivers/infiniband/hw/mlx5/qp.c              |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/qp.c |   4 +-
 include/rdma/ib_umem_odp.h                   |  52 +++++++----
 include/rdma/ib_verbs.h                      |   4 +-
 12 files changed, 164 insertions(+), 101 deletions(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index b899531..764f524 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -49,6 +49,16 @@ config INFINIBAND_ON_DEMAND_PAGING
 	  memory regions without pinning their pages, fetching the
 	  pages on demand instead.
 
+config INFINIBAND_ON_DEMAND_PAGING_HMM
+	bool "InfiniBand on-demand paging support using HMM."
+	depends on HMM
+	depends on INFINIBAND_ON_DEMAND_PAGING
+	default n
+	---help---
+	  Use HMM (heterogeneous memory management) kernel API for
+	  on demand paging. No userspace difference, this is just
+	  an alternative implementation of the feature.
+
 config INFINIBAND_ADDR_TRANS
 	bool
 	depends on INFINIBAND
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index d10dd88..e55e124 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -41,6 +41,9 @@
 #include <rdma/ib_umem.h>
 #include <rdma/ib_umem_odp.h>
 
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 static void ib_umem_notifier_start_account(struct ib_umem *item)
 {
 	mutex_lock(&item->odp_data->umem_mutex);
@@ -667,3 +670,4 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem *umem, u64 virt,
 	mutex_unlock(&umem->odp_data->umem_mutex);
 }
 EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index a9f0489..ccd6bbe 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -290,8 +290,10 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 	struct ib_udata                   udata;
 	struct ib_device                 *ibdev = file->device->ib_dev;
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#ifndef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
 	struct ib_device_attr		  dev_attr;
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 	struct ib_ucontext		 *ucontext;
 	struct file			 *filp;
 	int ret;
@@ -335,6 +337,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 	ucontext->closing = 0;
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#ifndef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
 	ucontext->umem_tree = RB_ROOT;
 	init_rwsem(&ucontext->umem_rwsem);
 	ucontext->odp_mrs_count = 0;
@@ -345,8 +348,8 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 		goto err_free;
 	if (!(dev_attr.device_cap_flags & IB_DEVICE_ON_DEMAND_PAGING))
 		ucontext->invalidate_range = NULL;
-
-#endif
+#endif /* !CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 	resp.num_comp_vectors = file->device->num_comp_vectors;
 
@@ -3335,6 +3338,9 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 		goto end;
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 	resp.odp_caps.general_caps = attr.odp_caps.general_caps;
 	resp.odp_caps.per_transport_caps.rc_odp_caps =
 		attr.odp_caps.per_transport_caps.rc_odp_caps;
@@ -3343,9 +3349,10 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 	resp.odp_caps.per_transport_caps.ud_odp_caps =
 		attr.odp_caps.per_transport_caps.ud_odp_caps;
 	resp.odp_caps.reserved = 0;
-#else
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 	memset(&resp.odp_caps, 0, sizeof(resp.odp_caps));
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 	resp.response_length += sizeof(resp.odp_caps);
 
 end:
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 57c9809..d553f90 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -156,10 +156,14 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
 	props->max_map_per_fmr = INT_MAX; /* no limit in ConnectIB */
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 	if (dev->mdev->caps.gen.flags & MLX5_DEV_CAP_FLAG_ON_DMND_PG)
 		props->device_cap_flags |= IB_DEVICE_ON_DEMAND_PAGING;
 	props->odp_caps = dev->odp_caps;
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 out:
 	kfree(in_mad);
@@ -486,8 +490,10 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct ib_device *ibdev,
 	}
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#ifndef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
 	context->ibucontext.invalidate_range = &mlx5_ib_invalidate_range;
-#endif
+#endif /* !CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 	INIT_LIST_HEAD(&context->db_page_list);
 	mutex_init(&context->db_page_mutex);
diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c
index df56b7d..21084c7 100644
--- a/drivers/infiniband/hw/mlx5/mem.c
+++ b/drivers/infiniband/hw/mlx5/mem.c
@@ -132,7 +132,7 @@ static u64 umem_dma_to_mtt(dma_addr_t umem_dma)
 
 	return mtt_entry;
 }
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 /*
  * Populate the given array with bus addresses from the umem.
@@ -163,6 +163,9 @@ void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
 	struct scatterlist *sg;
 	int entry;
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 	const bool odp = umem->odp_data != NULL;
 
 	if (odp) {
@@ -176,7 +179,8 @@ void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
 		}
 		return;
 	}
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 	i = 0;
 	for_each_sg(umem->sg_head.sgl, sg, umem->nmap, entry) {
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index ec629f2..a6d62be 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -231,7 +231,7 @@ struct mlx5_ib_qp {
 	 */
 	spinlock_t              disable_page_faults_lock;
 	struct mlx5_ib_pfault	pagefaults[MLX5_IB_PAGEFAULT_CONTEXTS];
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 };
 
 struct mlx5_ib_cq_buf {
@@ -440,7 +440,7 @@ struct mlx5_ib_dev {
 	 * being used by a page fault handler.
 	 */
 	struct srcu_struct      mr_srcu;
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 };
 
 static inline struct mlx5_ib_cq *to_mibcq(struct mlx5_core_cq *mcq)
@@ -627,8 +627,13 @@ int __init mlx5_ib_odp_init(void);
 void mlx5_ib_odp_cleanup(void);
 void mlx5_ib_qp_disable_pagefaults(struct mlx5_ib_qp *qp);
 void mlx5_ib_qp_enable_pagefaults(struct mlx5_ib_qp *qp);
+
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 void mlx5_ib_invalidate_range(struct ib_umem *umem, unsigned long start,
 			      unsigned long end);
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 static inline int mlx5_ib_internal_query_odp_caps(struct mlx5_ib_dev *dev)
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 759ed15..23cd123 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -62,7 +62,7 @@ static int destroy_mkey(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr)
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
 	/* Wait until all page fault handlers using the mr complete. */
 	synchronize_srcu(&dev->mr_srcu);
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 	return err;
 }
@@ -1114,7 +1114,7 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 		 */
 		smp_wmb();
 	}
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 	return &mr->ibmr;
 
@@ -1209,9 +1209,13 @@ int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
 		mr->live = 0;
 		/* Wait for all running page-fault handlers to finish. */
 		synchronize_srcu(&dev->mr_srcu);
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 		/* Destroy all page mappings */
 		mlx5_ib_invalidate_range(umem, ib_umem_start(umem),
 					 ib_umem_end(umem));
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 		/*
 		 * We kill the umem before the MR for ODP,
 		 * so that there will not be any invalidations in
@@ -1223,7 +1227,7 @@ int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
 		/* Avoid double-freeing the umem. */
 		umem = NULL;
 	}
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 	clean_mr(mr);
 
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 5171959..1de4d13 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -37,12 +37,30 @@
 
 #define MAX_PREFETCH_LEN (4*1024*1024U)
 
+struct workqueue_struct *mlx5_ib_page_fault_wq;
+
+static struct mlx5_ib_mr *mlx5_ib_odp_find_mr_lkey(struct mlx5_ib_dev *dev,
+						   u32 key)
+{
+	u32 base_key = mlx5_base_mkey(key);
+	struct mlx5_core_mr *mmr = __mlx5_mr_lookup(dev->mdev, base_key);
+	struct mlx5_ib_mr *mr = container_of(mmr, struct mlx5_ib_mr, mmr);
+
+	if (!mmr || mmr->key != key || !mr->live)
+		return NULL;
+
+	return container_of(mmr, struct mlx5_ib_mr, mmr);
+}
+
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+
+
 /* Timeout in ms to wait for an active mmu notifier to complete when handling
  * a pagefault. */
 #define MMU_NOTIFIER_TIMEOUT 1000
 
-struct workqueue_struct *mlx5_ib_page_fault_wq;
-
 void mlx5_ib_invalidate_range(struct ib_umem *umem, unsigned long start,
 			      unsigned long end)
 {
@@ -110,67 +128,6 @@ void mlx5_ib_invalidate_range(struct ib_umem *umem, unsigned long start,
 	ib_umem_odp_unmap_dma_pages(umem, start, end);
 }
 
-#define COPY_ODP_BIT_MLX_TO_IB(reg, ib_caps, field_name, bit_name) do {	\
-	if (be32_to_cpu(reg.field_name) & MLX5_ODP_SUPPORT_##bit_name)	\
-		ib_caps->field_name |= IB_ODP_SUPPORT_##bit_name;	\
-} while (0)
-
-int mlx5_ib_internal_query_odp_caps(struct mlx5_ib_dev *dev)
-{
-	int err;
-	struct mlx5_odp_caps hw_caps;
-	struct ib_odp_caps *caps = &dev->odp_caps;
-
-	memset(caps, 0, sizeof(*caps));
-
-	if (!(dev->mdev->caps.gen.flags & MLX5_DEV_CAP_FLAG_ON_DMND_PG))
-		return 0;
-
-	err = mlx5_query_odp_caps(dev->mdev, &hw_caps);
-	if (err)
-		goto out;
-
-	caps->general_caps = IB_ODP_SUPPORT;
-	COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.ud_odp_caps,
-			       SEND);
-	COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.rc_odp_caps,
-			       SEND);
-	COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.rc_odp_caps,
-			       RECV);
-	COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.rc_odp_caps,
-			       WRITE);
-	COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.rc_odp_caps,
-			       READ);
-
-out:
-	return err;
-}
-
-static struct mlx5_ib_mr *mlx5_ib_odp_find_mr_lkey(struct mlx5_ib_dev *dev,
-						   u32 key)
-{
-	u32 base_key = mlx5_base_mkey(key);
-	struct mlx5_core_mr *mmr = __mlx5_mr_lookup(dev->mdev, base_key);
-	struct mlx5_ib_mr *mr = container_of(mmr, struct mlx5_ib_mr, mmr);
-
-	if (!mmr || mmr->key != key || !mr->live)
-		return NULL;
-
-	return container_of(mmr, struct mlx5_ib_mr, mmr);
-}
-
-static void mlx5_ib_page_fault_resume(struct mlx5_ib_qp *qp,
-				      struct mlx5_ib_pfault *pfault,
-				      int error) {
-	struct mlx5_ib_dev *dev = to_mdev(qp->ibqp.pd->device);
-	int ret = mlx5_core_page_fault_resume(dev->mdev, qp->mqp.qpn,
-					      pfault->mpfault.flags,
-					      error);
-	if (ret)
-		pr_err("Failed to resolve the page fault on QP 0x%x\n",
-		       qp->mqp.qpn);
-}
-
 /*
  * Handle a single data segment in a page-fault WQE.
  *
@@ -298,6 +255,58 @@ srcu_unlock:
 	return ret ? ret : npages;
 }
 
+
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+
+
+#define COPY_ODP_BIT_MLX_TO_IB(reg, ib_caps, field_name, bit_name) do {	\
+	if (be32_to_cpu(reg.field_name) & MLX5_ODP_SUPPORT_##bit_name)	\
+		ib_caps->field_name |= IB_ODP_SUPPORT_##bit_name;	\
+} while (0)
+
+int mlx5_ib_internal_query_odp_caps(struct mlx5_ib_dev *dev)
+{
+	int err;
+	struct mlx5_odp_caps hw_caps;
+	struct ib_odp_caps *caps = &dev->odp_caps;
+
+	memset(caps, 0, sizeof(*caps));
+
+	if (!(dev->mdev->caps.gen.flags & MLX5_DEV_CAP_FLAG_ON_DMND_PG))
+		return 0;
+
+	err = mlx5_query_odp_caps(dev->mdev, &hw_caps);
+	if (err)
+		goto out;
+
+	caps->general_caps = IB_ODP_SUPPORT;
+	COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.ud_odp_caps,
+			       SEND);
+	COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.rc_odp_caps,
+			       SEND);
+	COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.rc_odp_caps,
+			       RECV);
+	COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.rc_odp_caps,
+			       WRITE);
+	COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.rc_odp_caps,
+			       READ);
+
+out:
+	return err;
+}
+
+static void mlx5_ib_page_fault_resume(struct mlx5_ib_qp *qp,
+				      struct mlx5_ib_pfault *pfault,
+				      int error) {
+	struct mlx5_ib_dev *dev = to_mdev(qp->ibqp.pd->device);
+	int ret = mlx5_core_page_fault_resume(dev->mdev, qp->mqp.qpn,
+					      pfault->mpfault.flags,
+					      error);
+	if (ret)
+		pr_err("Failed to resolve the page fault on QP 0x%x\n",
+		       qp->mqp.qpn);
+}
+
 /**
  * Parse a series of data segments for page fault handling.
  *
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index d35f62d..e5dec1e 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -3046,7 +3046,7 @@ int mlx5_ib_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *qp_attr, int qp_attr
 	 * based upon this query's result.
 	 */
 	flush_workqueue(mlx5_ib_page_fault_wq);
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 	mutex_lock(&qp->mutex);
 	outb = kzalloc(sizeof(*outb), GFP_KERNEL);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/qp.c b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
index dc7dbf7..a437a14 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
@@ -175,7 +175,7 @@ void mlx5_eq_pagefault(struct mlx5_core_dev *dev, struct mlx5_eqe *eqe)
 
 	mlx5_core_put_rsc(common);
 }
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 int mlx5_core_create_qp(struct mlx5_core_dev *dev,
 			struct mlx5_core_qp *qp,
@@ -440,4 +440,4 @@ int mlx5_core_page_fault_resume(struct mlx5_core_dev *dev, u32 qpn,
 	return err;
 }
 EXPORT_SYMBOL_GPL(mlx5_core_page_fault_resume);
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index 3da0b16..765aeb3 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -43,6 +43,9 @@ struct umem_odp_node {
 };
 
 struct ib_umem_odp {
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+#else
 	/*
 	 * An array of the pages included in the on-demand paging umem.
 	 * Indices of pages that are currently not mapped into the device will
@@ -62,8 +65,6 @@ struct ib_umem_odp {
 	 * also protects access to the mmu notifier counters.
 	 */
 	struct mutex		umem_mutex;
-	void			*private; /* for the HW driver to use. */
-
 	/* When false, use the notifier counter in the ucontext struct. */
 	bool mn_counters_active;
 	int notifiers_seq;
@@ -72,12 +73,13 @@ struct ib_umem_odp {
 	/* A linked list of umems that don't have private mmu notifier
 	 * counters yet. */
 	struct list_head no_private_counters;
+	struct completion	notifier_completion;
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+	void			*private; /* for the HW driver to use. */
 	struct ib_umem		*umem;
 
 	/* Tree tracking */
 	struct umem_odp_node	interval_tree;
-
-	struct completion	notifier_completion;
 	int			dying;
 };
 
@@ -87,6 +89,28 @@ int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem);
 
 void ib_umem_odp_release(struct ib_umem *umem);
 
+void rbt_ib_umem_insert(struct umem_odp_node *node, struct rb_root *root);
+void rbt_ib_umem_remove(struct umem_odp_node *node, struct rb_root *root);
+typedef int (*umem_call_back)(struct ib_umem *item, u64 start, u64 end,
+			      void *cookie);
+/*
+ * Call the callback on each ib_umem in the range. Returns the logical or of
+ * the return values of the functions called.
+ */
+int rbt_ib_umem_for_each_in_range(struct rb_root *root, u64 start, u64 end,
+				  umem_call_back cb, void *cookie);
+
+struct umem_odp_node *rbt_ib_umem_iter_first(struct rb_root *root,
+					     u64 start, u64 last);
+struct umem_odp_node *rbt_ib_umem_iter_next(struct umem_odp_node *node,
+					    u64 start, u64 last);
+
+
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+
+
 /*
  * The lower 2 bits of the DMA address signal the R/W permissions for
  * the entry. To upgrade the permissions, provide the appropriate
@@ -100,28 +124,13 @@ void ib_umem_odp_release(struct ib_umem *umem);
 
 #define ODP_DMA_ADDR_MASK (~(ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT))
 
+
 int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 start_offset, u64 bcnt,
 			      u64 access_mask, unsigned long current_seq);
 
 void ib_umem_odp_unmap_dma_pages(struct ib_umem *umem, u64 start_offset,
 				 u64 bound);
 
-void rbt_ib_umem_insert(struct umem_odp_node *node, struct rb_root *root);
-void rbt_ib_umem_remove(struct umem_odp_node *node, struct rb_root *root);
-typedef int (*umem_call_back)(struct ib_umem *item, u64 start, u64 end,
-			      void *cookie);
-/*
- * Call the callback on each ib_umem in the range. Returns the logical or of
- * the return values of the functions called.
- */
-int rbt_ib_umem_for_each_in_range(struct rb_root *root, u64 start, u64 end,
-				  umem_call_back cb, void *cookie);
-
-struct umem_odp_node *rbt_ib_umem_iter_first(struct rb_root *root,
-					     u64 start, u64 last);
-struct umem_odp_node *rbt_ib_umem_iter_next(struct umem_odp_node *node,
-					    u64 start, u64 last);
-
 static inline int ib_umem_mmu_notifier_retry(struct ib_umem *item,
 					     unsigned long mmu_seq)
 {
@@ -145,8 +154,11 @@ static inline int ib_umem_mmu_notifier_retry(struct ib_umem *item,
 	return 0;
 }
 
+
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
+
 static inline int ib_umem_odp_get(struct ib_ucontext *context,
 				  struct ib_umem *umem)
 {
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 65994a1..7b00d30 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1157,6 +1157,7 @@ struct ib_ucontext {
 
 	struct pid             *tgid;
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#ifndef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
 	struct rb_root      umem_tree;
 	/*
 	 * Protects .umem_rbroot and tree, as well as odp_mrs_count and
@@ -1171,7 +1172,8 @@ struct ib_ucontext {
 	/* A list of umems that don't have private mmu notifier counters yet. */
 	struct list_head	no_private_counters;
 	int                     odp_mrs_count;
-#endif
+#endif /* !CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 };
 
 struct ib_uobject {
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 33/36] IB/odp/hmm: add core infiniband structure and helper for ODP with HMM.
       [not found] ` <1432239792-5002-1-git-send-email-jglisse@redhat.com>
                     ` (2 preceding siblings ...)
  2015-05-21 20:23   ` [PATCH 32/36] IB/odp/hmm: add new kernel option to use HMM for ODP jglisse
@ 2015-05-21 20:23   ` jglisse
  2015-06-24 13:59     ` Haggai Eran
  2015-05-21 20:23   ` [PATCH 34/36] IB/mlx5/hmm: add mlx5 HMM device initialization and callback jglisse
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: jglisse @ 2015-05-21 20:23 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan

From: Jérôme Glisse <jglisse@redhat.com>

This add new core infiniband structure and helper to implement ODP (on
demand paging) on top of HMM. We need to retain the tree of ib_umem as
some hardware associate unique identifiant with each umem (or mr) and
only allow hardware page table to be updated using this unique id.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
cc: <linux-rdma@vger.kernel.org>
---
 drivers/infiniband/core/umem_odp.c    | 148 +++++++++++++++++++++++++++++++++-
 drivers/infiniband/core/uverbs_cmd.c  |   6 +-
 drivers/infiniband/core/uverbs_main.c |   6 ++
 include/rdma/ib_umem_odp.h            |  28 ++++++-
 include/rdma/ib_verbs.h               |  17 +++-
 5 files changed, 199 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index e55e124..d5d57a8 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -41,9 +41,155 @@
 #include <rdma/ib_umem.h>
 #include <rdma/ib_umem_odp.h>
 
+
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
-#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+
+
+static void ib_mirror_destroy(struct kref *kref)
+{
+	struct ib_mirror *ib_mirror;
+	struct ib_device *ib_device;
+
+	ib_mirror = container_of(kref, struct ib_mirror, kref);
+	hmm_mirror_unregister(&ib_mirror->base);
+
+	ib_device = ib_mirror->ib_device;
+	mutex_lock(&ib_device->hmm_mutex);
+	list_del_init(&ib_mirror->list);
+	mutex_unlock(&ib_device->hmm_mutex);
+	kfree(ib_mirror);
+}
+
+void ib_mirror_unref(struct ib_mirror *ib_mirror)
+{
+	if (ib_mirror == NULL)
+		return;
+
+	kref_put(&ib_mirror->kref, ib_mirror_destroy);
+}
+EXPORT_SYMBOL(ib_mirror_unref);
+
+static inline struct ib_mirror *ib_mirror_ref(struct ib_mirror *ib_mirror)
+{
+	if (!ib_mirror || !kref_get_unless_zero(&ib_mirror->kref))
+		return NULL;
+	return ib_mirror;
+}
+
+int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem)
+{
+	struct mm_struct *mm = get_task_mm(current);
+	struct ib_device *ib_device = context->device;
+	struct ib_mirror *ib_mirror;
+	struct pid *our_pid;
+	int ret;
+
+	if (!mm || !ib_device->hmm_ready)
+		return -EINVAL;
+
+	/* FIXME can this really happen ? */
+	if (unlikely(ib_umem_start(umem) == ib_umem_end(umem)))
+		return -EINVAL;
+
+	/* Prevent creating ODP MRs in child processes */
+	rcu_read_lock();
+	our_pid = get_task_pid(current->group_leader, PIDTYPE_PID);
+	rcu_read_unlock();
+	put_pid(our_pid);
+	if (context->tgid != our_pid) {
+		mmput(mm);
+		return -EINVAL;
+	}
+
+	umem->hugetlb = 0;
+	umem->odp_data = kmalloc(sizeof(*umem->odp_data), GFP_KERNEL);
+	if (umem->odp_data == NULL) {
+		mmput(mm);
+		return -ENOMEM;
+	}
+	umem->odp_data->private = NULL;
+	umem->odp_data->umem = umem;
+
+	mutex_lock(&ib_device->hmm_mutex);
+	/* Is there an existing mirror for this process mm ? */
+	ib_mirror = ib_mirror_ref(context->ib_mirror);
+	if (!ib_mirror)
+		list_for_each_entry(ib_mirror, &ib_device->ib_mirrors, list) {
+			if (ib_mirror->base.hmm->mm != mm)
+				continue;
+			ib_mirror = ib_mirror_ref(ib_mirror);
+			break;
+		}
+
+	if (ib_mirror == NULL ||
+	    ib_mirror == list_first_entry(&ib_device->ib_mirrors,
+					  struct ib_mirror, list)) {
+		/* We need to create a new mirror. */
+		ib_mirror = kmalloc(sizeof(*ib_mirror), GFP_KERNEL);
+		if (ib_mirror == NULL) {
+			mutex_unlock(&ib_device->hmm_mutex);
+			mmput(mm);
+			return -ENOMEM;
+		}
+		kref_init(&ib_mirror->kref);
+		init_rwsem(&ib_mirror->hmm_mr_rwsem);
+		ib_mirror->umem_tree = RB_ROOT;
+		ib_mirror->ib_device = ib_device;
+
+		ib_mirror->base.device = &ib_device->hmm_dev;
+		ret = hmm_mirror_register(&ib_mirror->base);
+		if (ret) {
+			mutex_unlock(&ib_device->hmm_mutex);
+			kfree(ib_mirror);
+			mmput(mm);
+			return ret;
+		}
+
+		list_add(&ib_mirror->list, &ib_device->ib_mirrors);
+		context->ib_mirror = ib_mirror_ref(ib_mirror);
+	}
+	mutex_unlock(&ib_device->hmm_mutex);
+	umem->odp_data.ib_mirror = ib_mirror;
+
+	down_write(&ib_mirror->umem_rwsem);
+	rbt_ib_umem_insert(&umem->odp_data->interval_tree, &mirror->umem_tree);
+	up_write(&ib_mirror->umem_rwsem);
+
+	mmput(mm);
+	return 0;
+}
+
+void ib_umem_odp_release(struct ib_umem *umem)
+{
+	struct ib_mirror *ib_mirror = umem->odp_data;
+
+	/*
+	 * Ensure that no more pages are mapped in the umem.
+	 *
+	 * It is the driver's responsibility to ensure, before calling us,
+	 * that the hardware will not attempt to access the MR any more.
+	 */
+
+	/* One optimization to release resources early here would be to call :
+	 *	hmm_mirror_range_discard(&ib_mirror->base,
+	 *			 ib_umem_start(umem),
+	 *			 ib_umem_end(umem));
+	 * But we can have overlapping umem so we would need to only discard
+	 * range covered by one and only one umem while holding the umem rwsem.
+	 */
+	down_write(&ib_mirror->umem_rwsem);
+	rbt_ib_umem_remove(&umem->odp_data->interval_tree, &mirror->umem_tree);
+	up_write(&ib_mirror->umem_rwsem);
+
+	ib_mirror_unref(ib_mirror);
+	kfree(umem->odp_data);
+	kfree(umem);
+}
+
+
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+
+
 static void ib_umem_notifier_start_account(struct ib_umem *item)
 {
 	mutex_lock(&item->odp_data->umem_mutex);
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index ccd6bbe..3225ab5 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -337,7 +337,9 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 	ucontext->closing = 0;
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
-#ifndef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+	ucontext->ib_mirror = NULL;
+#else  /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 	ucontext->umem_tree = RB_ROOT;
 	init_rwsem(&ucontext->umem_rwsem);
 	ucontext->odp_mrs_count = 0;
@@ -348,7 +350,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 		goto err_free;
 	if (!(dev_attr.device_cap_flags & IB_DEVICE_ON_DEMAND_PAGING))
 		ucontext->invalidate_range = NULL;
-#endif /* !CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 	resp.num_comp_vectors = file->device->num_comp_vectors;
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index 88cce9b..3f069d7 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -45,6 +45,7 @@
 #include <linux/cdev.h>
 #include <linux/anon_inodes.h>
 #include <linux/slab.h>
+#include <rdma/ib_umem_odp.h>
 
 #include <asm/uaccess.h>
 
@@ -297,6 +298,11 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 		kfree(uobj);
 	}
 
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+	ib_mirror_unref(context->ib_mirror);
+	context->ib_mirror = NULL;
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+
 	put_pid(context->tgid);
 
 	return context->device->dealloc_ucontext(context);
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index 765aeb3..c7c2670 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -37,6 +37,32 @@
 #include <rdma/ib_verbs.h>
 #include <linux/interval_tree.h>
 
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+/* struct ib_mirror - per process mirror structure for infiniband driver.
+ *
+ * @ib_device: Infiniband device this mirror is associated with.
+ * @base: The hmm base mirror struct.
+ * @kref: Refcount for the structure.
+ * @list: For the list of ib_mirror of a given ib_device.
+ * @umem_tree: Red black tree of ib_umem ordered by virtual address.
+ * @umem_rwsem: Semaphore protecting the reb black tree.
+ *
+ * Because ib_ucontext struct is tie to file descriptor there can be several of
+ * them for a same process, which violate HMM requirement. Hence we create only
+ * one ib_mirror struct per process and have each ib_umem struct reference it.
+ */
+struct ib_mirror {
+	struct ib_device	*ib_device;
+	struct hmm_mirror	base;
+	struct kref		kref;
+	struct list_head	list;
+	struct rb_root		umem_tree;
+	struct rw_semaphore	umem_rwsem;
+};
+
+void ib_mirror_unref(struct ib_mirror *ib_mirror);
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+
 struct umem_odp_node {
 	u64 __subtree_last;
 	struct rb_node rb;
@@ -44,7 +70,7 @@ struct umem_odp_node {
 
 struct ib_umem_odp {
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
-#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+	struct ib_mirror	*ib_mirror;
 #else
 	/*
 	 * An array of the pages included in the on-demand paging umem.
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 7b00d30..83da1bd 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -49,6 +49,9 @@
 #include <linux/scatterlist.h>
 #include <linux/workqueue.h>
 #include <uapi/linux/if_ether.h>
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#include <linux/hmm.h>
+#endif
 
 #include <linux/atomic.h>
 #include <linux/mmu_notifier.h>
@@ -1157,7 +1160,9 @@ struct ib_ucontext {
 
 	struct pid             *tgid;
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
-#ifndef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+	struct ib_mirror	*ib_mirror;
+#else  /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 	struct rb_root      umem_tree;
 	/*
 	 * Protects .umem_rbroot and tree, as well as odp_mrs_count and
@@ -1172,7 +1177,7 @@ struct ib_ucontext {
 	/* A list of umems that don't have private mmu notifier counters yet. */
 	struct list_head	no_private_counters;
 	int                     odp_mrs_count;
-#endif /* !CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 };
 
@@ -1657,6 +1662,14 @@ struct ib_device {
 
 	struct ib_dma_mapping_ops   *dma_ops;
 
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+	/* For ODP using HMM. */
+	struct hmm_device	     hmm_dev;
+	struct list_head	     ib_mirrors;
+	struct mutex		     hmm_mutex;
+	bool			     hmm_ready;
+#endif
+
 	struct module               *owner;
 	struct device                dev;
 	struct kobject               *ports_parent;
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 34/36] IB/mlx5/hmm: add mlx5 HMM device initialization and callback.
       [not found] ` <1432239792-5002-1-git-send-email-jglisse@redhat.com>
                     ` (3 preceding siblings ...)
  2015-05-21 20:23   ` [PATCH 33/36] IB/odp/hmm: add core infiniband structure and helper for ODP with HMM jglisse
@ 2015-05-21 20:23   ` jglisse
  2015-05-21 20:23   ` [PATCH 35/36] IB/mlx5/hmm: add page fault support for ODP on HMM jglisse
  2015-05-21 20:23   ` [PATCH 36/36] IB/mlx5/hmm: enable ODP using HMM jglisse
  6 siblings, 0 replies; 22+ messages in thread
From: jglisse @ 2015-05-21 20:23 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan

From: Jérôme Glisse <jglisse@redhat.com>

This add the core HMM callback for mlx5 device driver and initialize
the HMM device for the mlx5 infiniband device driver.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
cc: <linux-rdma@vger.kernel.org>
---
 drivers/infiniband/core/umem_odp.c   |  12 ++-
 drivers/infiniband/hw/mlx5/main.c    |   5 +
 drivers/infiniband/hw/mlx5/mem.c     |  36 ++++++-
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  19 +++-
 drivers/infiniband/hw/mlx5/mr.c      |   9 +-
 drivers/infiniband/hw/mlx5/odp.c     | 177 ++++++++++++++++++++++++++++++++++-
 include/rdma/ib_umem_odp.h           |  20 +++-
 7 files changed, 268 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index d5d57a8..559542d 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -132,7 +132,7 @@ int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem)
 			return -ENOMEM;
 		}
 		kref_init(&ib_mirror->kref);
-		init_rwsem(&ib_mirror->hmm_mr_rwsem);
+		init_rwsem(&ib_mirror->umem_rwsem);
 		ib_mirror->umem_tree = RB_ROOT;
 		ib_mirror->ib_device = ib_device;
 
@@ -149,10 +149,11 @@ int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem)
 		context->ib_mirror = ib_mirror_ref(ib_mirror);
 	}
 	mutex_unlock(&ib_device->hmm_mutex);
-	umem->odp_data.ib_mirror = ib_mirror;
+	umem->odp_data->ib_mirror = ib_mirror;
 
 	down_write(&ib_mirror->umem_rwsem);
-	rbt_ib_umem_insert(&umem->odp_data->interval_tree, &mirror->umem_tree);
+	rbt_ib_umem_insert(&umem->odp_data->interval_tree,
+			   &ib_mirror->umem_tree);
 	up_write(&ib_mirror->umem_rwsem);
 
 	mmput(mm);
@@ -161,7 +162,7 @@ int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem)
 
 void ib_umem_odp_release(struct ib_umem *umem)
 {
-	struct ib_mirror *ib_mirror = umem->odp_data;
+	struct ib_mirror *ib_mirror = umem->odp_data->ib_mirror;
 
 	/*
 	 * Ensure that no more pages are mapped in the umem.
@@ -178,7 +179,8 @@ void ib_umem_odp_release(struct ib_umem *umem)
 	 * range covered by one and only one umem while holding the umem rwsem.
 	 */
 	down_write(&ib_mirror->umem_rwsem);
-	rbt_ib_umem_remove(&umem->odp_data->interval_tree, &mirror->umem_tree);
+	rbt_ib_umem_remove(&umem->odp_data->interval_tree,
+			   &ib_mirror->umem_tree);
 	up_write(&ib_mirror->umem_rwsem);
 
 	ib_mirror_unref(ib_mirror);
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index d553f90..eddabf0 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -1316,6 +1316,9 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 	if (err)
 		goto err_rsrc;
 
+	/* If HMM initialization fails we just do not enable odp. */
+	mlx5_dev_init_odp_hmm(&dev->ib_dev, &mdev->pdev->dev);
+
 	err = ib_register_device(&dev->ib_dev, NULL);
 	if (err)
 		goto err_odp;
@@ -1340,6 +1343,7 @@ err_umrc:
 
 err_dev:
 	ib_unregister_device(&dev->ib_dev);
+	mlx5_dev_fini_odp_hmm(&dev->ib_dev);
 
 err_odp:
 	mlx5_ib_odp_remove_one(dev);
@@ -1359,6 +1363,7 @@ static void mlx5_ib_remove(struct mlx5_core_dev *mdev, void *context)
 
 	ib_unregister_device(&dev->ib_dev);
 	destroy_umrc_res(dev);
+	mlx5_dev_fini_odp_hmm(&dev->ib_dev);
 	mlx5_ib_odp_remove_one(dev);
 	destroy_dev_resources(&dev->devr);
 	ib_dealloc_device(&dev->ib_dev);
diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c
index 21084c7..f150825 100644
--- a/drivers/infiniband/hw/mlx5/mem.c
+++ b/drivers/infiniband/hw/mlx5/mem.c
@@ -164,7 +164,41 @@ void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
 	int entry;
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
-#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+	if (umem->odp_data) {
+		struct ib_mirror *ib_mirror = umem->odp_data->ib_mirror;
+		struct hmm_mirror *mirror = &ib_mirror->base;
+		struct hmm_pt_iter *iter = data, local_iter;
+		unsigned long addr;
+
+		if (iter == NULL) {
+			iter = &local_iter;
+			hmm_pt_iter_init(iter);
+		}
+
+		for (i = 0, addr = ib_umem_start(umem) + (offset << PAGE_SHIFT);
+		     i < num_pages; ++i, addr += PAGE_SIZE) {
+			dma_addr_t *ptep, pte;
+
+			/* Get and lock pointer to mirror page table. */
+			ptep = hmm_pt_iter_update(iter, &mirror->pt, addr);
+			pte = ptep ? *ptep : 0;
+			/* HMM will not have any page tables set up, if this
+			 * function is called before page faults have happened
+			 * on the MR. In that case, we don't have PA's yet, so
+			 * just set each one to zero and continue on. The hw
+			 * will trigger a page fault.
+			 */
+			if (hmm_pte_test_valid_dma(&pte))
+				pas[i] = cpu_to_be64(umem_dma_to_mtt(pte));
+			else
+				pas[i] = (__be64)0;
+		}
+
+		if (iter == &local_iter)
+			hmm_pt_iter_fini(iter, &mirror->pt);
+
+		return;
+	}
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 	const bool odp = umem->odp_data != NULL;
 
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index a6d62be..f1bafd4 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -615,6 +615,7 @@ int mlx5_ib_check_mr_status(struct ib_mr *ibmr, u32 check_mask,
 			    struct ib_mr_status *mr_status);
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+
 extern struct workqueue_struct *mlx5_ib_page_fault_wq;
 
 int mlx5_ib_internal_query_odp_caps(struct mlx5_ib_dev *dev);
@@ -629,13 +630,18 @@ void mlx5_ib_qp_disable_pagefaults(struct mlx5_ib_qp *qp);
 void mlx5_ib_qp_enable_pagefaults(struct mlx5_ib_qp *qp);
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
-#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+void mlx5_dev_init_odp_hmm(struct ib_device *ib_dev, struct device *dev);
+void mlx5_dev_fini_odp_hmm(struct ib_device *ib_dev);
+int mlx5_ib_umem_invalidate(struct ib_umem *umem, u64 start,
+			    u64 end, void *cookie);
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 void mlx5_ib_invalidate_range(struct ib_umem *umem, unsigned long start,
 			      unsigned long end);
 #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 
+
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
+
 static inline int mlx5_ib_internal_query_odp_caps(struct mlx5_ib_dev *dev)
 {
 	return 0;
@@ -671,4 +677,15 @@ static inline u8 convert_access(int acc)
 #define MLX5_MAX_UMR_SHIFT 16
 #define MLX5_MAX_UMR_PAGES (1 << MLX5_MAX_UMR_SHIFT)
 
+#ifndef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+static inline void mlx5_dev_init_odp_hmm(struct ib_device *ib_dev,
+					 struct device *dev)
+{
+}
+
+static inline void mlx5_dev_fini_odp_hmm(struct ib_device *ib_dev)
+{
+}
+#endif /* ! CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+
 #endif /* MLX5_IB_H */
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 23cd123..7b2d84a 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1210,7 +1210,14 @@ int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
 		/* Wait for all running page-fault handlers to finish. */
 		synchronize_srcu(&dev->mr_srcu);
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
-#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+		if (mlx5_ib_umem_invalidate(umem, ib_umem_start(umem),
+					    ib_umem_end(umem), NULL))
+			/*
+			 * FIXME do something to kill all mr and umem
+			 * in use by this process.
+			 */
+			pr_err("killing all mr with odp due to "
+			       "mtt update failure\n");
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 		/* Destroy all page mappings */
 		mlx5_ib_invalidate_range(umem, ib_umem_start(umem),
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 1de4d13..bd29155 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -52,8 +52,183 @@ static struct mlx5_ib_mr *mlx5_ib_odp_find_mr_lkey(struct mlx5_ib_dev *dev,
 	return container_of(mmr, struct mlx5_ib_mr, mmr);
 }
 
+
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
-#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+
+
+int mlx5_ib_umem_invalidate(struct ib_umem *umem, u64 start,
+			    u64 end, void *cookie)
+{
+	const u64 umr_block_mask = (MLX5_UMR_MTT_ALIGNMENT / sizeof(u64)) - 1;
+	u64 idx = 0, blk_start_idx = 0;
+	struct hmm_pt_iter iter;
+	struct mlx5_ib_mr *mlx5_ib_mr;
+	struct hmm_mirror *mirror;
+	int in_block = 0;
+	u64 addr;
+	int ret = 0;
+
+	if (!umem || !umem->odp_data) {
+		pr_err("invalidation called on NULL umem or non-ODP umem\n");
+		return -EINVAL;
+	}
+
+	/* Is this ib_mr active and registered yet ? */
+	if (umem->odp_data->private == NULL)
+		return 0;
+
+	mlx5_ib_mr = umem->odp_data->private;
+	if (!mlx5_ib_mr->ibmr.pd)
+		return 0;
+
+	start = max_t(u64, ib_umem_start(umem), start);
+	end = min_t(u64, ib_umem_end(umem), end);
+	hmm_pt_iter_init(&iter);
+	mirror = &umem->odp_data->ib_mirror->base;
+
+	/*
+	 * Iteration one - zap the HW's MTTs. HMM ensures that while we are
+	 * doing the invalidation, no page fault will attempt to overwrite the
+	 * same MTTs.  Concurent invalidations might race us, but they will
+	 * write 0s as well, so no difference in the end result.
+	 */
+	for (addr = start; addr < end; addr += (u64)umem->page_size) {
+		dma_addr_t *ptep;
+
+		/* Need to happen before ptep as ptep might break the loop and
+		 * idx might be use outside the loop.
+		 */
+		idx = (addr - ib_umem_start(umem)) / PAGE_SIZE;
+
+		/* Get and lock pointer to mirror page table. */
+		ptep = hmm_pt_iter_update(&iter, &mirror->pt, addr);
+		if (!ptep) {
+			addr = hmm_pt_iter_next(&iter, &mirror->pt, addr, end);
+			continue;
+		}
+
+		/*
+		 * Strive to write the MTTs in chunks, but avoid overwriting
+		 * non-existing MTTs. The huristic here can be improved to
+		 * estimate the cost of another UMR vs. the cost of bigger
+		 * UMR.
+		 */
+		if ((*ptep) & (ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT)) {
+			if ((*ptep) & ODP_WRITE_ALLOWED_BIT)
+				hmm_pte_set_dirty(ptep);
+			/*
+			 * Because there can not be concurrent overlapping
+			 * munmap, page migrate, page write protect then it
+			 * is safe here to clear those bits.
+			 */
+			hmm_pte_clear_bit(ptep, ODP_READ_ALLOWED_SHIFT);
+			hmm_pte_clear_bit(ptep, ODP_WRITE_ALLOWED_SHIFT);
+			if (!in_block) {
+				blk_start_idx = idx;
+				in_block = 1;
+			}
+		} else {
+			u64 umr_offset = idx & umr_block_mask;
+
+			if (in_block && umr_offset == 0) {
+				ret = mlx5_ib_update_mtt(mlx5_ib_mr,
+							 blk_start_idx,
+							 idx - blk_start_idx, 1,
+							 &iter) || ret;
+				in_block = 0;
+			}
+		}
+	}
+	if (in_block)
+		ret = mlx5_ib_update_mtt(mlx5_ib_mr, blk_start_idx,
+					 idx - blk_start_idx + 1, 1,
+					 &iter) || ret;
+	hmm_pt_iter_fini(&iter, &mirror->pt);
+	return ret;
+}
+
+static int mlx5_hmm_invalidate_range(struct hmm_mirror *mirror,
+				     unsigned long start,
+				     unsigned long end)
+{
+	struct ib_mirror *ib_mirror;
+	int ret;
+
+	ib_mirror = container_of(mirror, struct ib_mirror, base);
+
+	/* Go over all memory region and invalidate them. */
+	down_read(&ib_mirror->umem_rwsem);
+	ret = rbt_ib_umem_for_each_in_range(&ib_mirror->umem_tree, start, end,
+					    mlx5_ib_umem_invalidate, NULL);
+	up_read(&ib_mirror->umem_rwsem);
+	return ret;
+}
+
+static void mlx5_hmm_release(struct hmm_mirror *mirror)
+{
+	struct ib_mirror *ib_mirror;
+
+	ib_mirror = container_of(mirror, struct ib_mirror, base);
+
+	/* Go over all memory region and invalidate them. */
+	mlx5_hmm_invalidate_range(mirror, 0, ULLONG_MAX);
+}
+
+static int mlx5_hmm_update(struct hmm_mirror *mirror,
+			   const struct hmm_event *event)
+{
+	struct device *device = mirror->device->dev;
+	int ret = 0;
+
+	switch (event->etype) {
+	case HMM_DEVICE_RFAULT:
+	case HMM_DEVICE_WFAULT:
+		/* FIXME implement. */
+		break;
+	case HMM_ISDIRTY:
+		hmm_mirror_range_dirty(mirror, event->start, event->end);
+		break;
+	case HMM_NONE:
+	default:
+		dev_warn(device, "Warning: unhandled HMM event (%d)"
+			 "defaulting to invalidation\n", event->etype);
+		/* Fallthrough. */
+	/* For write protect and fork we could only invalidate writeable mr. */
+	case HMM_WRITE_PROTECT:
+	case HMM_MIGRATE:
+	case HMM_MUNMAP:
+	case HMM_FORK:
+		ret = mlx5_hmm_invalidate_range(mirror,
+						event->start,
+						event->end);
+		break;
+	}
+
+	return ret;
+}
+
+static const struct hmm_device_ops mlx5_hmm_ops = {
+	.release		= &mlx5_hmm_release,
+	.update			= &mlx5_hmm_update,
+};
+
+void mlx5_dev_init_odp_hmm(struct ib_device *ib_device, struct device *dev)
+{
+	INIT_LIST_HEAD(&ib_device->ib_mirrors);
+	ib_device->hmm_dev.dev = dev;
+	ib_device->hmm_dev.ops = &mlx5_hmm_ops;
+	ib_device->hmm_ready = !hmm_device_register(&ib_device->hmm_dev);
+	mutex_init(&ib_device->hmm_mutex);
+}
+
+void mlx5_dev_fini_odp_hmm(struct ib_device *ib_device)
+{
+	if (!ib_device->hmm_ready)
+		return;
+	hmm_device_unregister(&ib_device->hmm_dev);
+}
+
+
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 
 
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index c7c2670..e982fd3 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -133,7 +133,25 @@ struct umem_odp_node *rbt_ib_umem_iter_next(struct umem_odp_node *node,
 
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
-#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+
+
+/*
+ * HMM have few bits reserved for hardware specific bits inside the mirror page
+ * table. For IB we record the mapping protection per page there.
+ */
+#define ODP_READ_ALLOWED_SHIFT	(HMM_PTE_HW_SHIFT + 0)
+#define ODP_WRITE_ALLOWED_SHIFT	(HMM_PTE_HW_SHIFT + 1)
+#define ODP_READ_ALLOWED_BIT	(1 << ODP_READ_ALLOWED_SHIFT)
+#define ODP_WRITE_ALLOWED_BIT	(1 << ODP_WRITE_ALLOWED_SHIFT)
+
+/* Make sure we are not overwritting valid address bit on target arch. */
+#if (HMM_PTE_HW_SHIFT + 2) > PAGE_SHIFT
+#error (HMM_PTE_HW_SHIFT + 2) > PAGE_SHIFT
+#endif
+
+#define ODP_DMA_ADDR_MASK HMM_PTE_DMA_MASK
+
+
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 
 
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 35/36] IB/mlx5/hmm: add page fault support for ODP on HMM.
       [not found] ` <1432239792-5002-1-git-send-email-jglisse@redhat.com>
                     ` (4 preceding siblings ...)
  2015-05-21 20:23   ` [PATCH 34/36] IB/mlx5/hmm: add mlx5 HMM device initialization and callback jglisse
@ 2015-05-21 20:23   ` jglisse
  2015-05-21 20:23   ` [PATCH 36/36] IB/mlx5/hmm: enable ODP using HMM jglisse
  6 siblings, 0 replies; 22+ messages in thread
From: jglisse @ 2015-05-21 20:23 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan

From: Jérôme Glisse <jglisse@redhat.com>

This patch add HMM specific support for hardware page faulting of
user memory region.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
cc: <linux-rdma@vger.kernel.org>
---
 drivers/infiniband/hw/mlx5/odp.c | 147 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 146 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index bd29155..093f5b8 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -56,6 +56,55 @@ static struct mlx5_ib_mr *mlx5_ib_odp_find_mr_lkey(struct mlx5_ib_dev *dev,
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
 
 
+struct mlx5_hmm_pfault {
+	struct mlx5_ib_mr	*mlx5_ib_mr;
+	u64			start_idx;
+	dma_addr_t		access_mask;
+	unsigned		npages;
+	struct hmm_event	event;
+};
+
+static int mlx5_hmm_pfault(struct mlx5_ib_dev *mlx5_ib_dev,
+			   struct hmm_mirror *mirror,
+			   const struct hmm_event *event)
+{
+	struct mlx5_hmm_pfault *pfault;
+	struct hmm_pt_iter iter;
+	unsigned long addr, cnt;
+	int ret;
+
+	pfault = container_of(event, struct mlx5_hmm_pfault, event);
+	hmm_pt_iter_init(&iter);
+
+	for (addr = event->start, cnt = 0; addr < event->end;
+	     addr += PAGE_SIZE, ++cnt) {
+		dma_addr_t *ptep;
+
+		/* Get and lock pointer to mirror page table. */
+		ptep = hmm_pt_iter_update(&iter, &mirror->pt, addr);
+		/* This could be BUG_ON() as it can not happen. */
+		if (!ptep || !hmm_pte_test_valid_dma(ptep)) {
+			pr_warn("got empty mirror page table on pagefault.\n");
+			return -EINVAL;
+		}
+		if ((pfault->access_mask & ODP_WRITE_ALLOWED_BIT)) {
+			if (!hmm_pte_test_write(ptep)) {
+				pr_warn("got wrong protection permission on "
+					"pagefault.\n");
+				return -EINVAL;
+			}
+			hmm_pte_set_bit(ptep, ODP_WRITE_ALLOWED_SHIFT);
+		}
+		hmm_pte_set_bit(ptep, ODP_READ_ALLOWED_SHIFT);
+		pfault->npages++;
+	}
+	ret = mlx5_ib_update_mtt(pfault->mlx5_ib_mr,
+				 pfault->start_idx,
+				 cnt, 0, &iter);
+	hmm_pt_iter_fini(&iter, &mirror->pt);
+	return ret;
+}
+
 int mlx5_ib_umem_invalidate(struct ib_umem *umem, u64 start,
 			    u64 end, void *cookie)
 {
@@ -178,12 +227,19 @@ static int mlx5_hmm_update(struct hmm_mirror *mirror,
 			   const struct hmm_event *event)
 {
 	struct device *device = mirror->device->dev;
+	struct mlx5_ib_dev *mlx5_ib_dev;
+	struct ib_device *ib_device;
 	int ret = 0;
 
+	ib_device = container_of(mirror->device, struct ib_device, hmm_dev);
+	mlx5_ib_dev = to_mdev(ib_device);
+
 	switch (event->etype) {
 	case HMM_DEVICE_RFAULT:
 	case HMM_DEVICE_WFAULT:
-		/* FIXME implement. */
+		ret = mlx5_hmm_pfault(mlx5_ib_dev, mirror, event);
+		if (ret)
+			return ret;
 		break;
 	case HMM_ISDIRTY:
 		hmm_mirror_range_dirty(mirror, event->start, event->end);
@@ -228,6 +284,95 @@ void mlx5_dev_fini_odp_hmm(struct ib_device *ib_device)
 	hmm_device_unregister(&ib_device->hmm_dev);
 }
 
+/*
+ * Handle a single data segment in a page-fault WQE.
+ *
+ * Returns number of pages retrieved on success. The caller will continue to
+ * the next data segment.
+ * Can return the following error codes:
+ * -EAGAIN to designate a temporary error. The caller will abort handling the
+ *  page fault and resolve it.
+ * -EFAULT when there's an error mapping the requested pages. The caller will
+ *  abort the page fault handling and possibly move the QP to an error state.
+ * On other errors the QP should also be closed with an error.
+ */
+static int pagefault_single_data_segment(struct mlx5_ib_qp *qp,
+					 struct mlx5_ib_pfault *pfault,
+					 u32 key, u64 io_virt, size_t bcnt,
+					 u32 *bytes_mapped)
+{
+	struct mlx5_ib_dev *mlx5_ib_dev = to_mdev(qp->ibqp.pd->device);
+	struct ib_mirror *ib_mirror;
+	struct mlx5_hmm_pfault hmm_pfault;
+	int srcu_key;
+	int ret = 0;
+
+	srcu_key = srcu_read_lock(&mlx5_ib_dev->mr_srcu);
+	hmm_pfault.mlx5_ib_mr = mlx5_ib_odp_find_mr_lkey(mlx5_ib_dev, key);
+	/*
+	 * If we didn't find the MR, it means the MR was closed while we were
+	 * handling the ODP event. In this case we return -EFAULT so that the
+	 * QP will be closed.
+	 */
+	if (!hmm_pfault.mlx5_ib_mr || !hmm_pfault.mlx5_ib_mr->ibmr.pd) {
+		pr_err("Failed to find relevant mr for lkey=0x%06x, probably "
+		       "the MR was destroyed\n", key);
+		ret = -EFAULT;
+		goto srcu_unlock;
+	}
+	if (!hmm_pfault.mlx5_ib_mr->umem->odp_data) {
+		pr_debug("skipping non ODP MR (lkey=0x%06x) in page fault "
+		         "handler.\n", key);
+		if (bytes_mapped)
+			*bytes_mapped +=
+				(bcnt - pfault->mpfault.bytes_committed);
+		goto srcu_unlock;
+	}
+	if (hmm_pfault.mlx5_ib_mr->ibmr.pd != qp->ibqp.pd) {
+		pr_err("Page-fault with different PDs for QP and MR.\n");
+		ret = -EFAULT;
+		goto srcu_unlock;
+	}
+
+	ib_mirror = hmm_pfault.mlx5_ib_mr->umem->odp_data->ib_mirror;
+	if (ib_mirror->base.hmm == NULL) {
+		/* Somehow the mirror was kill from under us. */
+		ret = -EFAULT;
+		goto srcu_unlock;
+	}
+
+	/*
+	 * Avoid branches - this code will perform correctly
+	 * in all iterations (in iteration 2 and above,
+	 * bytes_committed == 0).
+	 */
+	io_virt += pfault->mpfault.bytes_committed;
+	bcnt -= pfault->mpfault.bytes_committed;
+
+	hmm_pfault.npages = 0;
+	hmm_pfault.start_idx = (io_virt - (hmm_pfault.mlx5_ib_mr->mmr.iova &
+					   PAGE_MASK)) >> PAGE_SHIFT;
+	hmm_pfault.access_mask = ODP_READ_ALLOWED_BIT;
+	hmm_pfault.access_mask |= hmm_pfault.mlx5_ib_mr->umem->writable ?
+				  ODP_WRITE_ALLOWED_BIT : 0;
+	hmm_pfault.event.start = io_virt & PAGE_MASK;
+	hmm_pfault.event.end = PAGE_ALIGN(io_virt + bcnt);
+	hmm_pfault.event.etype = hmm_pfault.mlx5_ib_mr->umem->writable ?
+				 HMM_DEVICE_WFAULT : HMM_DEVICE_RFAULT;
+	ret = hmm_mirror_fault(&ib_mirror->base, &hmm_pfault.event);
+
+	if (!ret && hmm_pfault.npages && bytes_mapped) {
+		u32 new_mappings = hmm_pfault.npages * PAGE_SIZE -
+				   (io_virt - round_down(io_virt, PAGE_SIZE));
+		*bytes_mapped += min_t(u32, new_mappings, bcnt);
+	}
+
+srcu_unlock:
+	srcu_read_unlock(&mlx5_ib_dev->mr_srcu, srcu_key);
+	pfault->mpfault.bytes_committed = 0;
+	return ret ? ret : hmm_pfault.npages;
+}
+
 
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 36/36] IB/mlx5/hmm: enable ODP using HMM.
       [not found] ` <1432239792-5002-1-git-send-email-jglisse@redhat.com>
                     ` (5 preceding siblings ...)
  2015-05-21 20:23   ` [PATCH 35/36] IB/mlx5/hmm: add page fault support for ODP on HMM jglisse
@ 2015-05-21 20:23   ` jglisse
  6 siblings, 0 replies; 22+ messages in thread
From: jglisse @ 2015-05-21 20:23 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan

From: Jérôme Glisse <jglisse@redhat.com>

All pieces are in place for ODP (on demand paging) to work using HMM.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
cc: <linux-rdma@vger.kernel.org>
---
 drivers/infiniband/core/uverbs_cmd.c | 4 ----
 drivers/infiniband/hw/mlx5/main.c    | 6 +++++-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 3225ab5..16520bd 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -3340,9 +3340,6 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 		goto end;
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
-#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
-#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 	resp.odp_caps.general_caps = attr.odp_caps.general_caps;
 	resp.odp_caps.per_transport_caps.rc_odp_caps =
 		attr.odp_caps.per_transport_caps.rc_odp_caps;
@@ -3351,7 +3348,6 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 	resp.odp_caps.per_transport_caps.ud_odp_caps =
 		attr.odp_caps.per_transport_caps.ud_odp_caps;
 	resp.odp_caps.reserved = 0;
-#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 	memset(&resp.odp_caps, 0, sizeof(resp.odp_caps));
 #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index eddabf0..cff70a2 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -157,7 +157,11 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
-#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+	if ((dev->mdev->caps.gen.flags & MLX5_DEV_CAP_FLAG_ON_DMND_PG) &&
+	     ibdev->hmm_ready) {
+		props->device_cap_flags |= IB_DEVICE_ON_DEMAND_PAGING;
+		props->odp_caps = dev->odp_caps;
+	}
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 	if (dev->mdev->caps.gen.flags & MLX5_DEV_CAP_FLAG_ON_DMND_PG)
 		props->device_cap_flags |= IB_DEVICE_ON_DEMAND_PAGING;
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
  2015-05-21 19:31 ` [PATCH 05/36] HMM: introduce heterogeneous memory management v3 j.glisse
@ 2015-05-27  5:50   ` Aneesh Kumar K.V
  2015-05-27 14:38     ` Jerome Glisse
  2015-06-08 19:40   ` Mark Hairgrove
  1 sibling, 1 reply; 22+ messages in thread
From: Aneesh Kumar K.V @ 2015-05-27  5:50 UTC (permalink / raw)
  To: j.glisse, akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan

j.glisse@gmail.com writes:

> From: Jérôme Glisse <jglisse@redhat.com>
>
> This patch only introduce core HMM functions for registering a new
> mirror and stopping a mirror as well as HMM device registering and
> unregistering.
>
> The lifecycle of HMM object is handled differently then the one of
> mmu_notifier because unlike mmu_notifier there can be concurrent
> call from both mm code to HMM code and/or from device driver code
> to HMM code. Moreover lifetime of HMM can be uncorrelated from the
> lifetime of the process that is being mirror (GPU might take longer
> time to cleanup).
>

......

> +struct hmm_device_ops {
> +	/* release() - mirror must stop using the address space.
> +	 *
> +	 * @mirror: The mirror that link process address space with the device.
> +	 *
> +	 * When this is call, device driver must kill all device thread using

s/call/called, ?

> +	 * this mirror. Also, this callback is the last thing call by HMM and
> +	 * HMM will not access the mirror struct after this call (ie no more
> +	 * dereference of it so it is safe for the device driver to free it).
> +	 * It is call either from :
> +	 *   - mm dying (all process using this mm exiting).
> +	 *   - hmm_mirror_unregister() (if no other thread holds a reference)
> +	 *   - outcome of some device error reported by any of the device
> +	 *     callback against that mirror.
> +	 */
> +	void (*release)(struct hmm_mirror *mirror);
> +};
> +
> +
> +/* struct hmm - per mm_struct HMM states.
> + *
> + * @mm: The mm struct this hmm is associated with.
> + * @mirrors: List of all mirror for this mm (one per device).
> + * @vm_end: Last valid address for this mm (exclusive).
> + * @kref: Reference counter.
> + * @rwsem: Serialize the mirror list modifications.
> + * @mmu_notifier: The mmu_notifier of this mm.
> + * @rcu: For delayed cleanup call from mmu_notifier.release() callback.
> + *
> + * For each process address space (mm_struct) there is one and only one hmm
> + * struct. hmm functions will redispatch to each devices the change made to
> + * the process address space.
> + *
> + * Device driver must not access this structure other than for getting the
> + * mm pointer.
> + */

.....

>  #ifndef AT_VECTOR_SIZE_ARCH
>  #define AT_VECTOR_SIZE_ARCH 0
>  #endif
> @@ -451,6 +455,16 @@ struct mm_struct {
>  #ifdef CONFIG_MMU_NOTIFIER
>  	struct mmu_notifier_mm *mmu_notifier_mm;
>  #endif
> +#ifdef CONFIG_HMM
> +	/*
> +	 * hmm always register an mmu_notifier we rely on mmu notifier to keep
> +	 * refcount on mm struct as well as forbiding registering hmm on a
> +	 * dying mm
> +	 *
> +	 * This field is set with mmap_sem old in write mode.

s/old/held/ ?


> +	 */
> +	struct hmm *hmm;
> +#endif
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
>  	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
>  #endif
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0e0ae9a..4083be7 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -27,6 +27,7 @@
>  #include <linux/binfmts.h>
>  #include <linux/mman.h>
>  #include <linux/mmu_notifier.h>
> +#include <linux/hmm.h>
>  #include <linux/fs.h>
>  #include <linux/mm.h>
>  #include <linux/vmacache.h>
> @@ -597,6 +598,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
>  	mm_init_aio(mm);
>  	mm_init_owner(mm, p);
>  	mmu_notifier_mm_init(mm);
> +	hmm_mm_init(mm);
>  	clear_tlb_flush_pending(mm);
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
>  	mm->pmd_huge_pte = NULL;
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 52ffb86..189e48f 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -653,3 +653,18 @@ config DEFERRED_STRUCT_PAGE_INIT
>  	  when kswapd starts. This has a potential performance impact on
>  	  processes running early in the lifetime of the systemm until kswapd
>  	  finishes the initialisation.
> +
> +if STAGING
> +config HMM
> +	bool "Enable heterogeneous memory management (HMM)"
> +	depends on MMU
> +	select MMU_NOTIFIER
> +	select GENERIC_PAGE_TABLE

What is GENERIC_PAGE_TABLE ?

> +	default n
> +	help
> +	  Heterogeneous memory management provide infrastructure for a device
> +	  to mirror a process address space into an hardware mmu or into any
> +	  things supporting pagefault like event.
> +
> +	  If unsure, say N to disable hmm.

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
  2015-05-27  5:50   ` Aneesh Kumar K.V
@ 2015-05-27 14:38     ` Jerome Glisse
  0 siblings, 0 replies; 22+ messages in thread
From: Jerome Glisse @ 2015-05-27 14:38 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel

On Wed, May 27, 2015 at 11:20:05AM +0530, Aneesh Kumar K.V wrote:
> j.glisse@gmail.com writes:

Noted your grammar fixes.

> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 52ffb86..189e48f 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -653,3 +653,18 @@ config DEFERRED_STRUCT_PAGE_INIT
> >  	  when kswapd starts. This has a potential performance impact on
> >  	  processes running early in the lifetime of the systemm until kswapd
> >  	  finishes the initialisation.
> > +
> > +if STAGING
> > +config HMM
> > +	bool "Enable heterogeneous memory management (HMM)"
> > +	depends on MMU
> > +	select MMU_NOTIFIER
> > +	select GENERIC_PAGE_TABLE
> 
> What is GENERIC_PAGE_TABLE ?

Let over of when patch 0006 what a seperate feature that was introduced
before this patch. I failed to remove that chunk. Just ignore it.

Cheers,
Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
  2015-05-21 19:31 ` [PATCH 05/36] HMM: introduce heterogeneous memory management v3 j.glisse
  2015-05-27  5:50   ` Aneesh Kumar K.V
@ 2015-06-08 19:40   ` Mark Hairgrove
       [not found]     ` <alpine.DEB.2.00.1506081222270.27796-ptWJzH4JGIzJt4XymMeBgkn48jw8i0AO@public.gmane.org>
  1 sibling, 1 reply; 22+ messages in thread
From: Mark Hairgrove @ 2015-06-08 19:40 UTC (permalink / raw)
  To: j.glisse@gmail.com
  Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Linus Torvalds, joro@8bytes.org, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind

[-- Attachment #1: Type: text/plain, Size: 7062 bytes --]

On Thu, 21 May 2015, j.glisse@gmail.com wrote:

> From: Jérôme Glisse <jglisse@redhat.com>
>
> This patch only introduce core HMM functions for registering a new
> mirror and stopping a mirror as well as HMM device registering and
> unregistering.
>
> [...]
>
> +/* struct hmm_device_operations - HMM device operation callback
> + */
> +struct hmm_device_ops {
> +	/* release() - mirror must stop using the address space.
> +	 *
> +	 * @mirror: The mirror that link process address space with the device.
> +	 *
> +	 * When this is call, device driver must kill all device thread using
> +	 * this mirror. Also, this callback is the last thing call by HMM and
> +	 * HMM will not access the mirror struct after this call (ie no more
> +	 * dereference of it so it is safe for the device driver to free it).
> +	 * It is call either from :
> +	 *   - mm dying (all process using this mm exiting).
> +	 *   - hmm_mirror_unregister() (if no other thread holds a reference)
> +	 *   - outcome of some device error reported by any of the device
> +	 *     callback against that mirror.
> +	 */
> +	void (*release)(struct hmm_mirror *mirror);
> +};

The comment that ->release is called when the mm dies doesn't match the
implementation. ->release is only called when the mirror is destroyed, and
that can only happen after the mirror has been unregistered. This may not
happen until after the mm dies.

Is the intent for the driver to get the callback when the mm goes down?
That seems beneficial so the driver can kill whatever's happening on the
device. Otherwise the device may continue operating in a dead address
space until the driver's file gets closed and it unregisters the mirror.

> +static void hmm_mirror_destroy(struct kref *kref)
> +{
> +	struct hmm_device *device;
> +	struct hmm_mirror *mirror;
> +	struct hmm *hmm;
> +
> +	mirror = container_of(kref, struct hmm_mirror, kref);
> +	device = mirror->device;
> +	hmm = mirror->hmm;
> +
> +	mutex_lock(&device->mutex);
> +	list_del_init(&mirror->dlist);
> +	device->ops->release(mirror);
> +	mutex_unlock(&device->mutex);
> +}

The hmm variable is unused. It also probably isn't safe to access at this
point.

> +static void hmm_mirror_kill(struct hmm_mirror *mirror)
> +{
> +	down_write(&mirror->hmm->rwsem);
> +	if (!hlist_unhashed(&mirror->mlist)) {
> +		hlist_del_init(&mirror->mlist);
> +		up_write(&mirror->hmm->rwsem);
> +
> +		hmm_mirror_unref(&mirror);
> +	} else
> +		up_write(&mirror->hmm->rwsem);
> +}

Shouldn't this call hmm_unref? hmm_mirror_register calls hmm_ref but
there's no corresponding hmm_unref when the mirror goes away. As a result
the hmm struct gets leaked and thus so does the entire mm since
mmu_notifier_unregister is never called.

It might also be a good idea to set mirror->hmm = NULL here to prevent
accidental use in say hmm_mirror_destroy.

> +/* hmm_device_unregister() - unregister a device with HMM.
> + *
> + * @device: The hmm_device struct.
> + * Returns: 0 on success or -EBUSY otherwise.
> + *
> + * Call when device driver want to unregister itself with HMM. This will check
> + * that there is no any active mirror and returns -EBUSY if so.
> + */
> +int hmm_device_unregister(struct hmm_device *device)
> +{
> +	mutex_lock(&device->mutex);
> +	if (!list_empty(&device->mirrors)) {
> +		mutex_unlock(&device->mutex);
> +		return -EBUSY;
> +	}
> +	mutex_unlock(&device->mutex);
> +	return 0;
> +}

I assume that the intention is for the caller to spin on
hmm_device_unregister until -EBUSY is no longer returned?

If so, I think there's a race here in the case of mm teardown happening
concurrently with hmm_mirror_unregister. This can happen if the parent
process was forked and exits while the child closes the file, or if the
file is passed to another process and closed last there while the original
process exits.

The upshot is that the hmm_device may still be referenced by another
thread even after hmm_device_unregister returns 0.

The below sequence shows how this might happen. Coming into this, the
mirror's ref count is 2:

Thread A (file close)               Thread B (process exit)
----------------------              ----------------------
                                    hmm_notifier_release
                                      down_write(&hmm->rwsem);
hmm_mirror_unregister
  hmm_mirror_kill
    down_write(&hmm->rwsem);
    // Blocked on thread B
                                      hlist_del_init(&mirror->mlist);
                                      up_write(&hmm->rwsem);

                                      // Thread A unblocked
                                      // Thread B is preempted
    // hlist_unhashed returns 1
    up_write(&hmm->rwsem);

  // Mirror ref goes 2 -> 1
  hmm_mirror_unref(&mirror);

  // hmm_mirror_unregister returns

At this point hmm_mirror_unregister has returned to the caller but the
mirror still is in use by thread B. Since all mirrors have been
unregistered, the driver in thread A is now free to call
hmm_device_unregister.

                                      // Thread B is scheduled

                                      // Mirror ref goes 1 -> 0
                                      hmm_mirror_unref(&mirror);
                                        hmm_mirror_destroy(&mirror)
                                          mutex_lock(&device->mutex);
                                          list_del_init(&mirror->dlist);
                                          device->ops->release(mirror);
                                          mutex_unlock(&device->mutex);

hmm_device_unregister
  mutex_lock(&device->mutex);
  // Device list empty
  mutex_unlock(&device->mutex);
  return 0;
// Caller frees device

Do you agree that this sequence can happen, or am I missing something
which prevents it?

If this can happen, the problem is that the only thing preventing thread A
from freeing the device is that thread B has device->mutex locked. That's
bad, because a lock within a structure cannot be used to control freeing
that structure. The mutex_unlock in thread B may internally still access
the mutex memory even after the atomic operation which unlocks the mutex
and unblocks thread A.

This can't be solved by having the driver wait for the ->release mirror
callback before it calls hmm_device_unregister, because the race happens
after that point.

A kref on the device itself might solve this, but the core issue IMO is
that hmm_mirror_unregister doesn't wait for hmm_notifier_release to
complete before returning. It feels like hmm_mirror_unregister wants to do
a synchronize_srcu on the mmu_notifier srcu. Is that possible?

Whatever the resolution, it would be useful for the block comments of
hmm_mirror_unregister and hmm_device_unregister to describe the
expectations on the caller and what the caller is guaranteed as far as
mirror and device lifetimes go.

Thanks,
Mark

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
       [not found]     ` <alpine.DEB.2.00.1506081222270.27796-ptWJzH4JGIzJt4XymMeBgkn48jw8i0AO@public.gmane.org>
@ 2015-06-08 21:17       ` Jerome Glisse
  2015-06-09  1:54         ` Mark Hairgrove
  0 siblings, 1 reply; 22+ messages in thread
From: Jerome Glisse @ 2015-06-08 21:17 UTC (permalink / raw)
  To: Mark Hairgrove
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, Linus Torvalds,
	joro-zLv9SwRftAIdnm+yROfE0A@public.gmane.org, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai

On Mon, Jun 08, 2015 at 12:40:18PM -0700, Mark Hairgrove wrote:
> 
> 
> On Thu, 21 May 2015, j.glisse-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wrote:
> 
> > From: Jérôme Glisse <jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >
> > This patch only introduce core HMM functions for registering a new
> > mirror and stopping a mirror as well as HMM device registering and
> > unregistering.
> >
> > [...]
> >
> > +/* struct hmm_device_operations - HMM device operation callback
> > + */
> > +struct hmm_device_ops {
> > +	/* release() - mirror must stop using the address space.
> > +	 *
> > +	 * @mirror: The mirror that link process address space with the device.
> > +	 *
> > +	 * When this is call, device driver must kill all device thread using
> > +	 * this mirror. Also, this callback is the last thing call by HMM and
> > +	 * HMM will not access the mirror struct after this call (ie no more
> > +	 * dereference of it so it is safe for the device driver to free it).
> > +	 * It is call either from :
> > +	 *   - mm dying (all process using this mm exiting).
> > +	 *   - hmm_mirror_unregister() (if no other thread holds a reference)
> > +	 *   - outcome of some device error reported by any of the device
> > +	 *     callback against that mirror.
> > +	 */
> > +	void (*release)(struct hmm_mirror *mirror);
> > +};
> 
> The comment that ->release is called when the mm dies doesn't match the
> implementation. ->release is only called when the mirror is destroyed, and
> that can only happen after the mirror has been unregistered. This may not
> happen until after the mm dies.
> 
> Is the intent for the driver to get the callback when the mm goes down?
> That seems beneficial so the driver can kill whatever's happening on the
> device. Otherwise the device may continue operating in a dead address
> space until the driver's file gets closed and it unregisters the mirror.

This was the intent before merging free & release. I guess i need to
reinstate the free versus release callback. Sadly the lifetime for HMM
is more complex than mmu_notifier as we intend the mirror struct to
be embedded into a driver private struct.

> 
> > +static void hmm_mirror_destroy(struct kref *kref)
> > +{
> > +	struct hmm_device *device;
> > +	struct hmm_mirror *mirror;
> > +	struct hmm *hmm;
> > +
> > +	mirror = container_of(kref, struct hmm_mirror, kref);
> > +	device = mirror->device;
> > +	hmm = mirror->hmm;
> > +
> > +	mutex_lock(&device->mutex);
> > +	list_del_init(&mirror->dlist);
> > +	device->ops->release(mirror);
> > +	mutex_unlock(&device->mutex);
> > +}
> 
> The hmm variable is unused. It also probably isn't safe to access at this
> point.

hmm_unref(hmm); was lost somewhere probably in a rebase and it is safe to
access hmm here.

> 
> 
> > +static void hmm_mirror_kill(struct hmm_mirror *mirror)
> > +{
> > +	down_write(&mirror->hmm->rwsem);
> > +	if (!hlist_unhashed(&mirror->mlist)) {
> > +		hlist_del_init(&mirror->mlist);
> > +		up_write(&mirror->hmm->rwsem);
> > +
> > +		hmm_mirror_unref(&mirror);
> > +	} else
> > +		up_write(&mirror->hmm->rwsem);
> > +}
> 
> Shouldn't this call hmm_unref? hmm_mirror_register calls hmm_ref but
> there's no corresponding hmm_unref when the mirror goes away. As a result
> the hmm struct gets leaked and thus so does the entire mm since
> mmu_notifier_unregister is never called.
> 
> It might also be a good idea to set mirror->hmm = NULL here to prevent
> accidental use in say hmm_mirror_destroy.

No, hmm_mirror_destroy must be the one doing the hmm_unref(hmm)

> 
> 
> > +/* hmm_device_unregister() - unregister a device with HMM.
> > + *
> > + * @device: The hmm_device struct.
> > + * Returns: 0 on success or -EBUSY otherwise.
> > + *
> > + * Call when device driver want to unregister itself with HMM. This will check
> > + * that there is no any active mirror and returns -EBUSY if so.
> > + */
> > +int hmm_device_unregister(struct hmm_device *device)
> > +{
> > +	mutex_lock(&device->mutex);
> > +	if (!list_empty(&device->mirrors)) {
> > +		mutex_unlock(&device->mutex);
> > +		return -EBUSY;
> > +	}
> > +	mutex_unlock(&device->mutex);
> > +	return 0;
> > +}
> 
> I assume that the intention is for the caller to spin on
> hmm_device_unregister until -EBUSY is no longer returned?
> 
> If so, I think there's a race here in the case of mm teardown happening
> concurrently with hmm_mirror_unregister. This can happen if the parent
> process was forked and exits while the child closes the file, or if the
> file is passed to another process and closed last there while the original
> process exits.
> 
> The upshot is that the hmm_device may still be referenced by another
> thread even after hmm_device_unregister returns 0.
> 
> The below sequence shows how this might happen. Coming into this, the
> mirror's ref count is 2:
> 
> Thread A (file close)               Thread B (process exit)
> ----------------------              ----------------------
>                                     hmm_notifier_release
>                                       down_write(&hmm->rwsem);
> hmm_mirror_unregister
>   hmm_mirror_kill
>     down_write(&hmm->rwsem);
>     // Blocked on thread B
>                                       hlist_del_init(&mirror->mlist);
>                                       up_write(&hmm->rwsem);
> 
>                                       // Thread A unblocked
>                                       // Thread B is preempted
>     // hlist_unhashed returns 1
>     up_write(&hmm->rwsem);
> 
>   // Mirror ref goes 2 -> 1
>   hmm_mirror_unref(&mirror);
> 
>   // hmm_mirror_unregister returns
> 
> At this point hmm_mirror_unregister has returned to the caller but the
> mirror still is in use by thread B. Since all mirrors have been
> unregistered, the driver in thread A is now free to call
> hmm_device_unregister.
> 
>                                       // Thread B is scheduled
> 
>                                       // Mirror ref goes 1 -> 0
>                                       hmm_mirror_unref(&mirror);
>                                         hmm_mirror_destroy(&mirror)
>                                           mutex_lock(&device->mutex);
>                                           list_del_init(&mirror->dlist);
>                                           device->ops->release(mirror);
>                                           mutex_unlock(&device->mutex);
> 
> hmm_device_unregister
>   mutex_lock(&device->mutex);
>   // Device list empty
>   mutex_unlock(&device->mutex);
>   return 0;
> // Caller frees device
> 
> Do you agree that this sequence can happen, or am I missing something
> which prevents it?

Can't happen because child have mm->hmm = NULL ie only one hmm per mm
and hmm is tie to only one mm. It is the responsability of the device
driver to make sure same apply to private reference to the hmm mirror
struct ie hmm_mirror should never be tie to a private file struct.

> 
> If this can happen, the problem is that the only thing preventing thread A
> from freeing the device is that thread B has device->mutex locked. That's
> bad, because a lock within a structure cannot be used to control freeing
> that structure. The mutex_unlock in thread B may internally still access
> the mutex memory even after the atomic operation which unlocks the mutex
> and unblocks thread A.
> 
> This can't be solved by having the driver wait for the ->release mirror
> callback before it calls hmm_device_unregister, because the race happens
> after that point.
> 
> A kref on the device itself might solve this, but the core issue IMO is
> that hmm_mirror_unregister doesn't wait for hmm_notifier_release to
> complete before returning. It feels like hmm_mirror_unregister wants to do
> a synchronize_srcu on the mmu_notifier srcu. Is that possible?

I guess i need to revisit once again the whole lifetime issue.

> 
> Whatever the resolution, it would be useful for the block comments of
> hmm_mirror_unregister and hmm_device_unregister to describe the
> expectations on the caller and what the caller is guaranteed as far as
> mirror and device lifetimes go.

Yes i need to fix comment and add again spend time on lifetime. I
obviously completely screw that up in that version of the patchset.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
  2015-06-08 21:17       ` Jerome Glisse
@ 2015-06-09  1:54         ` Mark Hairgrove
       [not found]           ` <alpine.DEB.2.00.1506081841490.1802-ptWJzH4JGIzJt4XymMeBgkn48jw8i0AO@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Mark Hairgrove @ 2015-06-09  1:54 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Linus Torvalds, joro@8bytes.org, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Lucien Dunning, Mark Hairgrove, Cameron Buschardt,
	Arvind

[-- Attachment #1: Type: text/plain, Size: 4069 bytes --]



On Mon, 8 Jun 2015, Jerome Glisse wrote:

> On Mon, Jun 08, 2015 at 12:40:18PM -0700, Mark Hairgrove wrote:
> >
> >
> > On Thu, 21 May 2015, j.glisse@gmail.com wrote:
> >
> > > From: Jérôme Glisse <jglisse@redhat.com>
> > >
> > > This patch only introduce core HMM functions for registering a new
> > > mirror and stopping a mirror as well as HMM device registering and
> > > unregistering.
> > >
> > > [...]
> > >
> > > +/* struct hmm_device_operations - HMM device operation callback
> > > + */
> > > +struct hmm_device_ops {
> > > +	/* release() - mirror must stop using the address space.
> > > +	 *
> > > +	 * @mirror: The mirror that link process address space with the device.
> > > +	 *
> > > +	 * When this is call, device driver must kill all device thread using
> > > +	 * this mirror. Also, this callback is the last thing call by HMM and
> > > +	 * HMM will not access the mirror struct after this call (ie no more
> > > +	 * dereference of it so it is safe for the device driver to free it).
> > > +	 * It is call either from :
> > > +	 *   - mm dying (all process using this mm exiting).
> > > +	 *   - hmm_mirror_unregister() (if no other thread holds a reference)
> > > +	 *   - outcome of some device error reported by any of the device
> > > +	 *     callback against that mirror.
> > > +	 */
> > > +	void (*release)(struct hmm_mirror *mirror);
> > > +};
> >
> > The comment that ->release is called when the mm dies doesn't match the
> > implementation. ->release is only called when the mirror is destroyed, and
> > that can only happen after the mirror has been unregistered. This may not
> > happen until after the mm dies.
> >
> > Is the intent for the driver to get the callback when the mm goes down?
> > That seems beneficial so the driver can kill whatever's happening on the
> > device. Otherwise the device may continue operating in a dead address
> > space until the driver's file gets closed and it unregisters the mirror.
>
> This was the intent before merging free & release. I guess i need to
> reinstate the free versus release callback. Sadly the lifetime for HMM
> is more complex than mmu_notifier as we intend the mirror struct to
> be embedded into a driver private struct.

Can you clarify how that's different from mmu_notifiers? Those are also
embedded into a driver-owned struct.

Is the goal to allow calling hmm_mirror_unregister from within the "mm is
dying" HMM callback? I don't know whether that's really necessary as long
as there's some association between the driver files and the mirrors.


> > If so, I think there's a race here in the case of mm teardown happening
> > concurrently with hmm_mirror_unregister.
> >
> > [...]
> >
> > Do you agree that this sequence can happen, or am I missing something
> > which prevents it?
>
> Can't happen because child have mm->hmm = NULL ie only one hmm per mm
> and hmm is tie to only one mm. It is the responsability of the device
> driver to make sure same apply to private reference to the hmm mirror
> struct ie hmm_mirror should never be tie to a private file struct.

It's useful for the driver to have some association between files and
mirrors. If the file is closed prior to process exit we would like to
unregister the mirror, otherwise it will persist until process teardown.
The association doesn't have to be 1:1 but having the files ref count the
mirror or something would be useful.

But even if we assume no association at all between files and mirrors, are
you sure that prevents the race? The driver may choose to unregister the
hmm_device at any point once its files are closed. In the case of module
unload the device unregister can't be prevented. If mm teardown hasn't
happened yet mirrors may still be active and registered on that
hmm_device. The driver thus has to first call hmm_mirror_unregister on all
active mirrors, then call hmm_device_unregister. mm teardown of those
mirrors may trigger at any point in this sequence, so we're right back to
that race.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
       [not found]           ` <alpine.DEB.2.00.1506081841490.1802-ptWJzH4JGIzJt4XymMeBgkn48jw8i0AO@public.gmane.org>
@ 2015-06-09 15:56             ` Jerome Glisse
  2015-06-10  3:33               ` Mark Hairgrove
  0 siblings, 1 reply; 22+ messages in thread
From: Jerome Glisse @ 2015-06-09 15:56 UTC (permalink / raw)
  To: Mark Hairgrove
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, Linus Torvalds,
	joro-zLv9SwRftAIdnm+yROfE0A@public.gmane.org, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai

On Mon, Jun 08, 2015 at 06:54:29PM -0700, Mark Hairgrove wrote:
> On Mon, 8 Jun 2015, Jerome Glisse wrote:
> > On Mon, Jun 08, 2015 at 12:40:18PM -0700, Mark Hairgrove wrote:
> > > On Thu, 21 May 2015, j.glisse-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wrote:
> > > > From: Jérôme Glisse <jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > >
> > > > This patch only introduce core HMM functions for registering a new
> > > > mirror and stopping a mirror as well as HMM device registering and
> > > > unregistering.
> > > >
> > > > [...]
> > > >
> > > > +/* struct hmm_device_operations - HMM device operation callback
> > > > + */
> > > > +struct hmm_device_ops {
> > > > +	/* release() - mirror must stop using the address space.
> > > > +	 *
> > > > +	 * @mirror: The mirror that link process address space with the device.
> > > > +	 *
> > > > +	 * When this is call, device driver must kill all device thread using
> > > > +	 * this mirror. Also, this callback is the last thing call by HMM and
> > > > +	 * HMM will not access the mirror struct after this call (ie no more
> > > > +	 * dereference of it so it is safe for the device driver to free it).
> > > > +	 * It is call either from :
> > > > +	 *   - mm dying (all process using this mm exiting).
> > > > +	 *   - hmm_mirror_unregister() (if no other thread holds a reference)
> > > > +	 *   - outcome of some device error reported by any of the device
> > > > +	 *     callback against that mirror.
> > > > +	 */
> > > > +	void (*release)(struct hmm_mirror *mirror);
> > > > +};
> > >
> > > The comment that ->release is called when the mm dies doesn't match the
> > > implementation. ->release is only called when the mirror is destroyed, and
> > > that can only happen after the mirror has been unregistered. This may not
> > > happen until after the mm dies.
> > >
> > > Is the intent for the driver to get the callback when the mm goes down?
> > > That seems beneficial so the driver can kill whatever's happening on the
> > > device. Otherwise the device may continue operating in a dead address
> > > space until the driver's file gets closed and it unregisters the mirror.
> >
> > This was the intent before merging free & release. I guess i need to
> > reinstate the free versus release callback. Sadly the lifetime for HMM
> > is more complex than mmu_notifier as we intend the mirror struct to
> > be embedded into a driver private struct.
> 
> Can you clarify how that's different from mmu_notifiers? Those are also
> embedded into a driver-owned struct.

For HMM you want to be able to kill a mirror from HMM, you might have kernel
thread call inside HMM with a mirror (outside any device file lifetime) ...
The mirror is not only use at register & unregister, there is a lot more thing
you can call using the HMM mirror struct.

So the HMM mirror lifetime as a result is more complex, it can not simply be
free from the mmu_notifier_release callback or randomly. It needs to be
refcounted. The mmu_notifier code assume that the mmu_notifier struct is
embedded inside a struct that has a lifetime properly synchronize with the
mm. For HMM mirror this is not something that sounds like a good idea as there
is too many way to get it wrong.

So idea of HMM mirror is that it can out last the mm lifetime but the HMM
struct can not. So you have hmm_mirror <~> hmm <-> mm and the mirror can be
"unlink" and have different lifetime from the hmm that itself has same life
time as mm.

> Is the goal to allow calling hmm_mirror_unregister from within the "mm is
> dying" HMM callback? I don't know whether that's really necessary as long
> as there's some association between the driver files and the mirrors.

No this is not a goal and i actualy forbid that.

> 
> > > If so, I think there's a race here in the case of mm teardown happening
> > > concurrently with hmm_mirror_unregister.
> > >
> > > [...]
> > >
> > > Do you agree that this sequence can happen, or am I missing something
> > > which prevents it?
> >
> > Can't happen because child have mm->hmm = NULL ie only one hmm per mm
> > and hmm is tie to only one mm. It is the responsability of the device
> > driver to make sure same apply to private reference to the hmm mirror
> > struct ie hmm_mirror should never be tie to a private file struct.
> 
> It's useful for the driver to have some association between files and
> mirrors. If the file is closed prior to process exit we would like to
> unregister the mirror, otherwise it will persist until process teardown.
> The association doesn't have to be 1:1 but having the files ref count the
> mirror or something would be useful.

This is allowed, i might have put strong word here, but you can associate
with a file struct. What you can not do is use the mirror from a different
process ie one with a different mm struct as mirror is linked to a single
mm. So on fork there is no callback to update the private file struct, when
the device file is duplicated (well just refcount inc) against a different
process. This is something you need to be carefull in your driver. Inside
the dummy driver i abuse that to actually test proper behavior of HMM but
it should not be use as an example.

> 
> But even if we assume no association at all between files and mirrors, are
> you sure that prevents the race? The driver may choose to unregister the
> hmm_device at any point once its files are closed. In the case of module
> unload the device unregister can't be prevented. If mm teardown hasn't
> happened yet mirrors may still be active and registered on that
> hmm_device. The driver thus has to first call hmm_mirror_unregister on all
> active mirrors, then call hmm_device_unregister. mm teardown of those
> mirrors may trigger at any point in this sequence, so we're right back to
> that race.

So when device driver unload the first thing it needs to do is kill all of
its context ie all of its HMM mirror (unregister them) by doing so it will
make sure that there can be no more call to any of its functions.

The race with mm teardown does not exist as what matter for mm teardown is
the fact that the mirror is on the struct hmm mirrors list or not. Either
the device driver is first to remove the mirror from the list or it is the
mm teardown but this is lock protected so only one thread can do it.

The issue you pointed is really about decoupling the lifetime of the mirror
context (ie hardware thread that use the mirror) and the lifetime of the
structure that embedded the hmm_mirror struct. The device driver will care
about the second while everything else will only really care about the
first. The second tells you when you know for sure that there will be no
more callback to your device driver code. The first only tells you that
there should be no more activity associated with that mirror but some thread
might still hold a reference on the underlying struct.

Hope this clarify design and motivation behind the hmm_mirror vs hmm struct
lifetime.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
  2015-06-09 15:56             ` Jerome Glisse
@ 2015-06-10  3:33               ` Mark Hairgrove
  2015-06-10 15:42                 ` Jerome Glisse
  0 siblings, 1 reply; 22+ messages in thread
From: Mark Hairgrove @ 2015-06-10  3:33 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Linus Torvalds, joro@8bytes.org, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran <hag>

[-- Attachment #1: Type: text/plain, Size: 7501 bytes --]

On Tue, 9 Jun 2015, Jerome Glisse wrote:

> On Mon, Jun 08, 2015 at 06:54:29PM -0700, Mark Hairgrove wrote:
> > Can you clarify how that's different from mmu_notifiers? Those are also
> > embedded into a driver-owned struct.
> 
> For HMM you want to be able to kill a mirror from HMM, you might have kernel
> thread call inside HMM with a mirror (outside any device file lifetime) ...
> The mirror is not only use at register & unregister, there is a lot more thing
> you can call using the HMM mirror struct.
> 
> So the HMM mirror lifetime as a result is more complex, it can not simply be
> free from the mmu_notifier_release callback or randomly. It needs to be
> refcounted.

Sure, there are driver -> HMM calls like hmm_mirror_fault that 
mmu_notifiers don't have, but I don't understand why that fundamentally 
makes HMM mirror lifetimes more complex. Decoupling hmm_mirror lifetime 
from mm lifetime adds complexity too.

> The mmu_notifier code assume that the mmu_notifier struct is
> embedded inside a struct that has a lifetime properly synchronize with the
> mm. For HMM mirror this is not something that sounds like a good idea as there
> is too many way to get it wrong.

What kind of synchronization with the mm are you referring to here? 
Clients of mmu_notifiers don't have to do anything as far as I know. 
They're guaranteed that the mm won't go away because each registered 
notifier bumps mm_count.

> So idea of HMM mirror is that it can out last the mm lifetime but the HMM
> struct can not. So you have hmm_mirror <~> hmm <-> mm and the mirror can be
> "unlink" and have different lifetime from the hmm that itself has same life
> time as mm.

Per the earlier discussion hmm_mirror_destroy is missing a call to 
hmm_unref. If that's added back I don't understand how the mirror can 
persist past the hmm struct. The mirror can be unlinked from hmm's list, 
yes, but that doesn't mean that hmm/mm can be torn down. The hmm/mm 
structs will stick around until hmm_destroy since that does the 
mmu_notifier_unregister. hmm_destroy can't be called until the last 
hmm_mirror_destroy.

Doesn't that mean that hmm/mm are guaranteed to be allocated until the 
last hmm_mirror_unregister? That sounds like a good guarantee to make.

> 
> > Is the goal to allow calling hmm_mirror_unregister from within the "mm is
> > dying" HMM callback? I don't know whether that's really necessary as long
> > as there's some association between the driver files and the mirrors.
> 
> No this is not a goal and i actualy forbid that.
> 
> > 
> > > > If so, I think there's a race here in the case of mm teardown happening
> > > > concurrently with hmm_mirror_unregister.
> > > >
> > > > [...]
> > > >
> > > > Do you agree that this sequence can happen, or am I missing something
> > > > which prevents it?
> > >
> > > Can't happen because child have mm->hmm = NULL ie only one hmm per mm
> > > and hmm is tie to only one mm. It is the responsability of the device
> > > driver to make sure same apply to private reference to the hmm mirror
> > > struct ie hmm_mirror should never be tie to a private file struct.
> > 
> > It's useful for the driver to have some association between files and
> > mirrors. If the file is closed prior to process exit we would like to
> > unregister the mirror, otherwise it will persist until process teardown.
> > The association doesn't have to be 1:1 but having the files ref count the
> > mirror or something would be useful.
> 
> This is allowed, i might have put strong word here, but you can associate
> with a file struct. What you can not do is use the mirror from a different
> process ie one with a different mm struct as mirror is linked to a single
> mm. So on fork there is no callback to update the private file struct, when
> the device file is duplicated (well just refcount inc) against a different
> process. This is something you need to be carefull in your driver. Inside
> the dummy driver i abuse that to actually test proper behavior of HMM but
> it should not be use as an example.

So to confirm, on all file operations from user space the driver is 
expected to check that current->mm matches the mm associated with the 
struct file's hmm_mirror?

On file->release the driver still ought to call hmm_mirror_unregister 
regardless of whether the mms match, otherwise we'll never tear down the 
mirror. That means we're not saved from the race condition because 
hmm_mirror_unregister can happen in one thread while hmm_notifier_release 
might be happening in another thread.

> > 
> > But even if we assume no association at all between files and mirrors, are
> > you sure that prevents the race? The driver may choose to unregister the
> > hmm_device at any point once its files are closed. In the case of module
> > unload the device unregister can't be prevented. If mm teardown hasn't
> > happened yet mirrors may still be active and registered on that
> > hmm_device. The driver thus has to first call hmm_mirror_unregister on all
> > active mirrors, then call hmm_device_unregister. mm teardown of those
> > mirrors may trigger at any point in this sequence, so we're right back to
> > that race.
> 
> So when device driver unload the first thing it needs to do is kill all of
> its context ie all of its HMM mirror (unregister them) by doing so it will
> make sure that there can be no more call to any of its functions.

When is the driver expected to call hmm_mirror_unregister? Is it file 
close, module unload, or some other time?

If it's file close, there's no need to unregister anything on module 
unload because the files were all closed already.

If it's module unload, then the mirrors and mms all get leaked until that 
point.

We're exposed to the race in both cases.

> 
> The race with mm teardown does not exist as what matter for mm teardown is
> the fact that the mirror is on the struct hmm mirrors list or not. Either
> the device driver is first to remove the mirror from the list or it is the
> mm teardown but this is lock protected so only one thread can do it.
> 

Agreed, removing the mirror from the list is not a "race" in the classical 
sense. The true race is between hmm_notifier_release's device mutex_unlock 
(process exit) and post-hmm_device_unregister device mutex free (driver 
close/unload). What I meant is that in order to expose that race you first 
need one thread to call hmm_mirror_unregister while another thread is in 
hmm_notifier_release.

Regardless of where hmm_mirror_unregister is called (file close, module 
unload, etc) it can happen concurrently with hmm_notifier_release so we're 
exposed to this race.

> The issue you pointed is really about decoupling the lifetime of the mirror
> context (ie hardware thread that use the mirror) and the lifetime of the
> structure that embedded the hmm_mirror struct. The device driver will care
> about the second while everything else will only really care about the
> first. The second tells you when you know for sure that there will be no
> more callback to your device driver code. The first only tells you that
> there should be no more activity associated with that mirror but some thread
> might still hold a reference on the underlying struct.
> 
> 
> Hope this clarify design and motivation behind the hmm_mirror vs hmm struct
> lifetime.
> 
> 
> Cheers,
> Jérôme
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
  2015-06-10  3:33               ` Mark Hairgrove
@ 2015-06-10 15:42                 ` Jerome Glisse
  2015-06-11  1:15                   ` Mark Hairgrove
  0 siblings, 1 reply; 22+ messages in thread
From: Jerome Glisse @ 2015-06-10 15:42 UTC (permalink / raw)
  To: Mark Hairgrove
  Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Linus Torvalds, joro@8bytes.org, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai

On Tue, Jun 09, 2015 at 08:33:12PM -0700, Mark Hairgrove wrote:
> 
> 
> On Tue, 9 Jun 2015, Jerome Glisse wrote:
> 
> > On Mon, Jun 08, 2015 at 06:54:29PM -0700, Mark Hairgrove wrote:
> > > Can you clarify how that's different from mmu_notifiers? Those are also
> > > embedded into a driver-owned struct.
> > 
> > For HMM you want to be able to kill a mirror from HMM, you might have kernel
> > thread call inside HMM with a mirror (outside any device file lifetime) ...
> > The mirror is not only use at register & unregister, there is a lot more thing
> > you can call using the HMM mirror struct.
> > 
> > So the HMM mirror lifetime as a result is more complex, it can not simply be
> > free from the mmu_notifier_release callback or randomly. It needs to be
> > refcounted.
> 
> Sure, there are driver -> HMM calls like hmm_mirror_fault that 
> mmu_notifiers don't have, but I don't understand why that fundamentally 
> makes HMM mirror lifetimes more complex. Decoupling hmm_mirror lifetime 
> from mm lifetime adds complexity too.

Driver->HMM calls can happen from random kernel thread thus you need to
garanty that hmm_mirror can not go away. More over you can have CPU MM
call into HMM outside of mmu_notifier. Basicly you can get to HMM code
by many different code path, unlike any of the current mmu_notifier.

So refcounting is necessary as otherwise the device driver might decide
to unregister and free the mirror while some other kernel thread is
about to dereference the exact same mirror. Synchronization with mmu
notifier srcu will not be enough in the case of page fault on remote
memory for instance. Other case too.

> 
> > The mmu_notifier code assume that the mmu_notifier struct is
> > embedded inside a struct that has a lifetime properly synchronize with the
> > mm. For HMM mirror this is not something that sounds like a good idea as there
> > is too many way to get it wrong.
> 
> What kind of synchronization with the mm are you referring to here? 
> Clients of mmu_notifiers don't have to do anything as far as I know. 
> They're guaranteed that the mm won't go away because each registered 
> notifier bumps mm_count.

So for all current user afaict (kvm, xen, intel, radeon) tie to a file,
(sgi gru) tie to vma, (iommu) tie to mm. Which means this is all properly
synchronize with lifetime of mm (ignoring the fork case).

The struct that is tie to mmu_notifier for all of them can be accessed
only through one code path (ioctl for most of them).

> 
> > So idea of HMM mirror is that it can out last the mm lifetime but the HMM
> > struct can not. So you have hmm_mirror <~> hmm <-> mm and the mirror can be
> > "unlink" and have different lifetime from the hmm that itself has same life
> > time as mm.
> 
> Per the earlier discussion hmm_mirror_destroy is missing a call to 
> hmm_unref. If that's added back I don't understand how the mirror can 
> persist past the hmm struct. The mirror can be unlinked from hmm's list, 
> yes, but that doesn't mean that hmm/mm can be torn down. The hmm/mm 
> structs will stick around until hmm_destroy since that does the 
> mmu_notifier_unregister. hmm_destroy can't be called until the last 
> hmm_mirror_destroy.
> 
> Doesn't that mean that hmm/mm are guaranteed to be allocated until the 
> last hmm_mirror_unregister? That sounds like a good guarantee to make.

Like said, just ignore current code it is utterly broken in so many way
when it comes to lifetime. I screw that part badly when reworking the
patchset, i was focusing on other part.

I fixed that in my tree, i am waiting for more review on other part as
anyway the lifetime thing is easy to rework/fix.

http://cgit.freedesktop.org/~glisse/linux/log/?h=hmm

[...]
> > > > > If so, I think there's a race here in the case of mm teardown happening
> > > > > concurrently with hmm_mirror_unregister.
> > > > >
> > > > > [...]
> > > > >
> > > > > Do you agree that this sequence can happen, or am I missing something
> > > > > which prevents it?
> > > >
> > > > Can't happen because child have mm->hmm = NULL ie only one hmm per mm
> > > > and hmm is tie to only one mm. It is the responsability of the device
> > > > driver to make sure same apply to private reference to the hmm mirror
> > > > struct ie hmm_mirror should never be tie to a private file struct.
> > > 
> > > It's useful for the driver to have some association between files and
> > > mirrors. If the file is closed prior to process exit we would like to
> > > unregister the mirror, otherwise it will persist until process teardown.
> > > The association doesn't have to be 1:1 but having the files ref count the
> > > mirror or something would be useful.
> > 
> > This is allowed, i might have put strong word here, but you can associate
> > with a file struct. What you can not do is use the mirror from a different
> > process ie one with a different mm struct as mirror is linked to a single
> > mm. So on fork there is no callback to update the private file struct, when
> > the device file is duplicated (well just refcount inc) against a different
> > process. This is something you need to be carefull in your driver. Inside
> > the dummy driver i abuse that to actually test proper behavior of HMM but
> > it should not be use as an example.
> 
> So to confirm, on all file operations from user space the driver is 
> expected to check that current->mm matches the mm associated with the 
> struct file's hmm_mirror?

Well you might have a valid usecase for that, just be aware that
anything your driver do with the hmm_mirror will actually impact
the mm of the parent. Which i assume is not what you want.

I would actualy thought that what you want is having a way to find
hmm_mirror using both device file & mm as a key. Otherwise you can
not really use HMM with process that like to fork themself. Which
is a valid usecase to me. For instance process start using HMM
through your driver, decide to fork itself and to also use HMM
through your driver inside its child.

> 
> On file->release the driver still ought to call hmm_mirror_unregister 
> regardless of whether the mms match, otherwise we'll never tear down the 
> mirror. That means we're not saved from the race condition because 
> hmm_mirror_unregister can happen in one thread while hmm_notifier_release 
> might be happening in another thread.

Again there is no race the mirror list is the synchronization point and
it is protected by a lock. So either hmm_mirror_unregister() wins or the
other thread hmm_notifier_release()

> > > But even if we assume no association at all between files and mirrors, are
> > > you sure that prevents the race? The driver may choose to unregister the
> > > hmm_device at any point once its files are closed. In the case of module
> > > unload the device unregister can't be prevented. If mm teardown hasn't
> > > happened yet mirrors may still be active and registered on that
> > > hmm_device. The driver thus has to first call hmm_mirror_unregister on all
> > > active mirrors, then call hmm_device_unregister. mm teardown of those
> > > mirrors may trigger at any point in this sequence, so we're right back to
> > > that race.
> > 
> > So when device driver unload the first thing it needs to do is kill all of
> > its context ie all of its HMM mirror (unregister them) by doing so it will
> > make sure that there can be no more call to any of its functions.
> 
> When is the driver expected to call hmm_mirror_unregister? Is it file 
> close, module unload, or some other time?
> 
> If it's file close, there's no need to unregister anything on module 
> unload because the files were all closed already.
> 
> If it's module unload, then the mirrors and mms all get leaked until that 
> point.
> 
> We're exposed to the race in both cases.

You unregister as soon as you want, it is up to your driver to do it,
i do not enforce anything. The only thing i enforce is that you can
not unregister the hmm device driver before all mirror are unregistered
and free.

So yes for device driver you want to unregister when device file is
close (which happens when the process exit).

In both cases there is no race as explain above.

> 
> > 
> > The race with mm teardown does not exist as what matter for mm teardown is
> > the fact that the mirror is on the struct hmm mirrors list or not. Either
> > the device driver is first to remove the mirror from the list or it is the
> > mm teardown but this is lock protected so only one thread can do it.
> > 
> 
> Agreed, removing the mirror from the list is not a "race" in the classical 
> sense. The true race is between hmm_notifier_release's device mutex_unlock 
> (process exit) and post-hmm_device_unregister device mutex free (driver 
> close/unload). What I meant is that in order to expose that race you first 
> need one thread to call hmm_mirror_unregister while another thread is in 
> hmm_notifier_release.
> 
> Regardless of where hmm_mirror_unregister is called (file close, module 
> unload, etc) it can happen concurrently with hmm_notifier_release so we're 
> exposed to this race.

There is no race here, the mirror struct will only be freed once as again
the list is a synchronization point. Whoever remove the mirror from the
list is responsible to drop the list reference.

In the fixed code the only thing that will happen twice is the ->release()
callback. Even that can be work around to garanty it is call only once.

Anyway i do not see anyrace here.

Cheers,
Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
  2015-06-10 15:42                 ` Jerome Glisse
@ 2015-06-11  1:15                   ` Mark Hairgrove
  2015-06-11 14:23                     ` Jerome Glisse
  0 siblings, 1 reply; 22+ messages in thread
From: Mark Hairgrove @ 2015-06-11  1:15 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Linus Torvalds, joro@8bytes.org, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran <hag>



On Wed, 10 Jun 2015, Jerome Glisse wrote:

> [...]
> 
> Like said, just ignore current code it is utterly broken in so many way
> when it comes to lifetime. I screw that part badly when reworking the
> patchset, i was focusing on other part.
> 
> I fixed that in my tree, i am waiting for more review on other part as
> anyway the lifetime thing is easy to rework/fix.
> 
> http://cgit.freedesktop.org/~glisse/linux/log/?h=hmm
> 

Ok, I'm working through the other patches so I'll check the updates out 
once I've made it through. My primary interest in this discussion is 
making sure we know the plan for mirror and device lifetimes.


> > So to confirm, on all file operations from user space the driver is 
> > expected to check that current->mm matches the mm associated with the 
> > struct file's hmm_mirror?
> 
> Well you might have a valid usecase for that, just be aware that
> anything your driver do with the hmm_mirror will actually impact
> the mm of the parent. Which i assume is not what you want.
> 
> I would actualy thought that what you want is having a way to find
> hmm_mirror using both device file & mm as a key. Otherwise you can
> not really use HMM with process that like to fork themself. Which
> is a valid usecase to me. For instance process start using HMM
> through your driver, decide to fork itself and to also use HMM
> through your driver inside its child.

Agreed, that sounds reasonable, and the use case is valid. I was digging 
into this to make sure we don't prevent that.


> > 
> > On file->release the driver still ought to call hmm_mirror_unregister 
> > regardless of whether the mms match, otherwise we'll never tear down the 
> > mirror. That means we're not saved from the race condition because 
> > hmm_mirror_unregister can happen in one thread while hmm_notifier_release 
> > might be happening in another thread.
> 
> Again there is no race the mirror list is the synchronization point and
> it is protected by a lock. So either hmm_mirror_unregister() wins or the
> other thread hmm_notifier_release()

Yes, I agree. That's not the race I'm worried about. I'm worried about a 
race on the device lifetime, but in order to hit that one first 
hmm_notifier_release must take the lock and remove the mirror from the 
list before hmm_mirror_unregister does it. That's why I brought it up.


> 
> You unregister as soon as you want, it is up to your driver to do it,
> i do not enforce anything. The only thing i enforce is that you can
> not unregister the hmm device driver before all mirror are unregistered
> and free.
> 
> So yes for device driver you want to unregister when device file is
> close (which happens when the process exit).

Sounds good.


> 
> There is no race here, the mirror struct will only be freed once as again
> the list is a synchronization point. Whoever remove the mirror from the
> list is responsible to drop the list reference.
> 
> In the fixed code the only thing that will happen twice is the ->release()
> callback. Even that can be work around to garanty it is call only once.
> 
> Anyway i do not see anyrace here.
> 

The mirror lifetime is fine. The problem I see is with the device lifetime 
on a multi-core system. Imagine this sequence:

- On CPU1 the mm associated with the mirror is going down
- On CPU2 the driver unregisters the mirror then the device

When this happens, the last device mutex_unlock on CPU1 is the only thing 
preventing the free of the device in CPU2. That doesn't work, as described 
in this thread: https://lkml.org/lkml/2013/12/2/997

Here's the full sequence again with mutex_unlock split apart. Hopefully 
this shows the device_unregister problem more clearly:

CPU1 (mm release)                   CPU2 (driver)
----------------------              ----------------------
hmm_notifier_release
  down_write(&hmm->rwsem);
  hlist_del_init(&mirror->mlist);
  up_write(&hmm->rwsem);

  // CPU1 thread is preempted or 
  // something
                                    hmm_mirror_unregister
                                      hmm_mirror_kill
                                        down_write(&hmm->rwsem);
                                        // mirror removed by CPU1 already
                                        // so hlist_unhashed returns 1
                                        up_write(&hmm->rwsem);

                                      hmm_mirror_unref(&mirror);
                                      // Mirror ref now 1

                                      // CPU2 thread is preempted or
                                      // something
// CPU1 thread is scheduled

hmm_mirror_unref(&mirror);
  // Mirror ref now 0, cleanup
  hmm_mirror_destroy(&mirror)
    mutex_lock(&device->mutex);
    list_del_init(&mirror->dlist);
    device->ops->release(mirror);
      kfree(mirror);
                                      // CPU2 thread is scheduled, now
                                      // both CPU1 and CPU2 are running

                                    hmm_device_unregister
                                      mutex_lock(&device->mutex);
                                        mutex_optimistic_spin()
    mutex_unlock(&device->mutex);
      [...]
      __mutex_unlock_common_slowpath
        // CPU2 releases lock
        atomic_set(&lock->count, 1);
                                          // Spinning CPU2 acquires now-
                                          // free lock
                                      // mutex_lock returns
                                      // Device list empty
                                      mutex_unlock(&device->mutex);
                                      return 0;
                                    kfree(hmm_device);
        // CPU1 still accessing 
        // hmm_device->mutex in 
        //__mutex_unlock_common_slowpath

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
  2015-06-11  1:15                   ` Mark Hairgrove
@ 2015-06-11 14:23                     ` Jerome Glisse
  2015-06-11 22:26                       ` Mark Hairgrove
  0 siblings, 1 reply; 22+ messages in thread
From: Jerome Glisse @ 2015-06-11 14:23 UTC (permalink / raw)
  To: Mark Hairgrove
  Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Linus Torvalds, joro@8bytes.org, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai

On Wed, Jun 10, 2015 at 06:15:08PM -0700, Mark Hairgrove wrote:

[...]
> > There is no race here, the mirror struct will only be freed once as again
> > the list is a synchronization point. Whoever remove the mirror from the
> > list is responsible to drop the list reference.
> > 
> > In the fixed code the only thing that will happen twice is the ->release()
> > callback. Even that can be work around to garanty it is call only once.
> > 
> > Anyway i do not see anyrace here.
> > 
> 
> The mirror lifetime is fine. The problem I see is with the device lifetime 
> on a multi-core system. Imagine this sequence:
> 
> - On CPU1 the mm associated with the mirror is going down
> - On CPU2 the driver unregisters the mirror then the device
> 
> When this happens, the last device mutex_unlock on CPU1 is the only thing 
> preventing the free of the device in CPU2. That doesn't work, as described 
> in this thread: https://lkml.org/lkml/2013/12/2/997
> 
> Here's the full sequence again with mutex_unlock split apart. Hopefully 
> this shows the device_unregister problem more clearly:
> 
> CPU1 (mm release)                   CPU2 (driver)
> ----------------------              ----------------------
> hmm_notifier_release
>   down_write(&hmm->rwsem);
>   hlist_del_init(&mirror->mlist);
>   up_write(&hmm->rwsem);
> 
>   // CPU1 thread is preempted or 
>   // something
>                                     hmm_mirror_unregister
>                                       hmm_mirror_kill
>                                         down_write(&hmm->rwsem);
>                                         // mirror removed by CPU1 already
>                                         // so hlist_unhashed returns 1
>                                         up_write(&hmm->rwsem);
> 
>                                       hmm_mirror_unref(&mirror);
>                                       // Mirror ref now 1
> 
>                                       // CPU2 thread is preempted or
>                                       // something
> // CPU1 thread is scheduled
> 
> hmm_mirror_unref(&mirror);
>   // Mirror ref now 0, cleanup
>   hmm_mirror_destroy(&mirror)
>     mutex_lock(&device->mutex);
>     list_del_init(&mirror->dlist);
>     device->ops->release(mirror);
>       kfree(mirror);
>                                       // CPU2 thread is scheduled, now
>                                       // both CPU1 and CPU2 are running
> 
>                                     hmm_device_unregister
>                                       mutex_lock(&device->mutex);
>                                         mutex_optimistic_spin()
>     mutex_unlock(&device->mutex);
>       [...]
>       __mutex_unlock_common_slowpath
>         // CPU2 releases lock
>         atomic_set(&lock->count, 1);
>                                           // Spinning CPU2 acquires now-
>                                           // free lock
>                                       // mutex_lock returns
>                                       // Device list empty
>                                       mutex_unlock(&device->mutex);
>                                       return 0;
>                                     kfree(hmm_device);
>         // CPU1 still accessing 
>         // hmm_device->mutex in 
>         //__mutex_unlock_common_slowpath

Ok i see the race you are afraid of and really it is an unlikely one
__mutex_unlock_common_slowpath() take a spinlock right after allowing
other to take the mutex, when we are in your scenario there is no
contention on that spinlock so it is taken right away and as there
is no one in the mutex wait list then it goes directly to unlock the
spinlock and return. You can ignore the debug function as if debugging
is enabled than the mutex_lock() would need to also take the spinlock
and thus you would have proper synchronization btw 2 thread thanks to
the mutex.wait_lock.

So basicly while CPU1 is going :
spin_lock(mutex.wait_lock)
if (!list_empty(mutex.wait_list)) {
  // wait_list is empty so branch not taken
}
spin_unlock(mutex.wait_lock)

CPU2 would have to test the mirror list and mutex_unlock and return
before the spin_unlock() of CPU1. This is a tight race, i can add a
synchronize_rcu() to device_unregister after the mutex_unlock() so
that we also add a grace period before the device is potentialy freed
which should make that race completely unlikely.

Moreover for something really bad to happen it would need that the
freed memory to be reallocated right away by some other thread. Which
really sound unlikely unless CPU1 is the slowest of all :)

Cheers,
Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
  2015-06-11 14:23                     ` Jerome Glisse
@ 2015-06-11 22:26                       ` Mark Hairgrove
  2015-06-15 14:32                         ` Jerome Glisse
  0 siblings, 1 reply; 22+ messages in thread
From: Mark Hairgrove @ 2015-06-11 22:26 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Linus Torvalds, joro@8bytes.org, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran <hag>

[-- Attachment #1: Type: text/plain, Size: 6523 bytes --]



On Thu, 11 Jun 2015, Jerome Glisse wrote:

> On Wed, Jun 10, 2015 at 06:15:08PM -0700, Mark Hairgrove wrote:
> 
> [...]
> > > There is no race here, the mirror struct will only be freed once as again
> > > the list is a synchronization point. Whoever remove the mirror from the
> > > list is responsible to drop the list reference.
> > > 
> > > In the fixed code the only thing that will happen twice is the ->release()
> > > callback. Even that can be work around to garanty it is call only once.
> > > 
> > > Anyway i do not see anyrace here.
> > > 
> > 
> > The mirror lifetime is fine. The problem I see is with the device lifetime 
> > on a multi-core system. Imagine this sequence:
> > 
> > - On CPU1 the mm associated with the mirror is going down
> > - On CPU2 the driver unregisters the mirror then the device
> > 
> > When this happens, the last device mutex_unlock on CPU1 is the only thing 
> > preventing the free of the device in CPU2. That doesn't work, as described 
> > in this thread: https://lkml.org/lkml/2013/12/2/997
> > 
> > Here's the full sequence again with mutex_unlock split apart. Hopefully 
> > this shows the device_unregister problem more clearly:
> > 
> > CPU1 (mm release)                   CPU2 (driver)
> > ----------------------              ----------------------
> > hmm_notifier_release
> >   down_write(&hmm->rwsem);
> >   hlist_del_init(&mirror->mlist);
> >   up_write(&hmm->rwsem);
> > 
> >   // CPU1 thread is preempted or 
> >   // something
> >                                     hmm_mirror_unregister
> >                                       hmm_mirror_kill
> >                                         down_write(&hmm->rwsem);
> >                                         // mirror removed by CPU1 already
> >                                         // so hlist_unhashed returns 1
> >                                         up_write(&hmm->rwsem);
> > 
> >                                       hmm_mirror_unref(&mirror);
> >                                       // Mirror ref now 1
> > 
> >                                       // CPU2 thread is preempted or
> >                                       // something
> > // CPU1 thread is scheduled
> > 
> > hmm_mirror_unref(&mirror);
> >   // Mirror ref now 0, cleanup
> >   hmm_mirror_destroy(&mirror)
> >     mutex_lock(&device->mutex);
> >     list_del_init(&mirror->dlist);
> >     device->ops->release(mirror);
> >       kfree(mirror);
> >                                       // CPU2 thread is scheduled, now
> >                                       // both CPU1 and CPU2 are running
> > 
> >                                     hmm_device_unregister
> >                                       mutex_lock(&device->mutex);
> >                                         mutex_optimistic_spin()
> >     mutex_unlock(&device->mutex);
> >       [...]
> >       __mutex_unlock_common_slowpath
> >         // CPU2 releases lock
> >         atomic_set(&lock->count, 1);
> >                                           // Spinning CPU2 acquires now-
> >                                           // free lock
> >                                       // mutex_lock returns
> >                                       // Device list empty
> >                                       mutex_unlock(&device->mutex);
> >                                       return 0;
> >                                     kfree(hmm_device);
> >         // CPU1 still accessing 
> >         // hmm_device->mutex in 
> >         //__mutex_unlock_common_slowpath
> 
> Ok i see the race you are afraid of and really it is an unlikely one
> __mutex_unlock_common_slowpath() take a spinlock right after allowing
> other to take the mutex, when we are in your scenario there is no
> contention on that spinlock so it is taken right away and as there
> is no one in the mutex wait list then it goes directly to unlock the
> spinlock and return. You can ignore the debug function as if debugging
> is enabled than the mutex_lock() would need to also take the spinlock
> and thus you would have proper synchronization btw 2 thread thanks to
> the mutex.wait_lock.
> 
> So basicly while CPU1 is going :
> spin_lock(mutex.wait_lock)
> if (!list_empty(mutex.wait_list)) {
>   // wait_list is empty so branch not taken
> }
> spin_unlock(mutex.wait_lock)
> 
> CPU2 would have to test the mirror list and mutex_unlock and return
> before the spin_unlock() of CPU1. This is a tight race, i can add a
> synchronize_rcu() to device_unregister after the mutex_unlock() so
> that we also add a grace period before the device is potentialy freed
> which should make that race completely unlikely.
> 
> Moreover for something really bad to happen it would need that the
> freed memory to be reallocated right away by some other thread. Which
> really sound unlikely unless CPU1 is the slowest of all :)
> 
> Cheers,
> Jérôme
> 

But CPU1 could get preempted between the atomic_set and the 
spin_lock_mutex, and then it doesn't matter whether or not a grace period 
has elapsed before CPU2 proceeds.

Making race conditions less likely just makes them harder to pinpoint when 
they inevitably appear in the wild. I don't think it makes sense to spend 
any effort in making a race condition less likely, and that thread I 
referenced (https://lkml.org/lkml/2013/12/2/997) is fairly strong evidence 
that fixing this race actually matters. So, I think this race condition 
really needs to be fixed.

One fix is for hmm_mirror_unregister to wait for hmm_notifier_release 
completion between hmm_mirror_kill and hmm_mirror_unref. It can do this by 
calling synchronize_srcu() on the mmu_notifier's srcu. This has the 
benefit that the driver is guaranteed not to get the "mm is dead" callback 
after hmm_mirror_unregister returns.

In fact, are there any callbacks on the mirror that can arrive after 
hmm_mirror_unregister? If so, how will hmm_device_unregister solve them?

>From a general standpoint, hmm_device_unregister must perform some kind of 
synchronization to be sure that all mirrors are completely released and 
done and no new callbacks will trigger. Since that has to be true, can't 
that synchronization be moved into hmm_mirror_unregister instead?

If that happens there's no need for a "mirror can be freed" ->release 
callback at all because the driver is guaranteed that a mirror is done 
after hmm_mirror_unregister.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
  2015-06-11 22:26                       ` Mark Hairgrove
@ 2015-06-15 14:32                         ` Jerome Glisse
  0 siblings, 0 replies; 22+ messages in thread
From: Jerome Glisse @ 2015-06-15 14:32 UTC (permalink / raw)
  To: Mark Hairgrove
  Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Linus Torvalds, joro@8bytes.org, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai

On Thu, Jun 11, 2015 at 03:26:46PM -0700, Mark Hairgrove wrote:
> On Thu, 11 Jun 2015, Jerome Glisse wrote:
> > On Wed, Jun 10, 2015 at 06:15:08PM -0700, Mark Hairgrove wrote:

[...]
> > Ok i see the race you are afraid of and really it is an unlikely one
> > __mutex_unlock_common_slowpath() take a spinlock right after allowing
> > other to take the mutex, when we are in your scenario there is no
> > contention on that spinlock so it is taken right away and as there
> > is no one in the mutex wait list then it goes directly to unlock the
> > spinlock and return. You can ignore the debug function as if debugging
> > is enabled than the mutex_lock() would need to also take the spinlock
> > and thus you would have proper synchronization btw 2 thread thanks to
> > the mutex.wait_lock.
> > 
> > So basicly while CPU1 is going :
> > spin_lock(mutex.wait_lock)
> > if (!list_empty(mutex.wait_list)) {
> >   // wait_list is empty so branch not taken
> > }
> > spin_unlock(mutex.wait_lock)
> > 
> > CPU2 would have to test the mirror list and mutex_unlock and return
> > before the spin_unlock() of CPU1. This is a tight race, i can add a
> > synchronize_rcu() to device_unregister after the mutex_unlock() so
> > that we also add a grace period before the device is potentialy freed
> > which should make that race completely unlikely.
> > 
> > Moreover for something really bad to happen it would need that the
> > freed memory to be reallocated right away by some other thread. Which
> > really sound unlikely unless CPU1 is the slowest of all :)
> > 
> > Cheers,
> > Jérôme
> > 
> 
> But CPU1 could get preempted between the atomic_set and the 
> spin_lock_mutex, and then it doesn't matter whether or not a grace period 
> has elapsed before CPU2 proceeds.
> 
> Making race conditions less likely just makes them harder to pinpoint when 
> they inevitably appear in the wild. I don't think it makes sense to spend 
> any effort in making a race condition less likely, and that thread I 
> referenced (https://lkml.org/lkml/2013/12/2/997) is fairly strong evidence 
> that fixing this race actually matters. So, I think this race condition 
> really needs to be fixed.
> 
> One fix is for hmm_mirror_unregister to wait for hmm_notifier_release 
> completion between hmm_mirror_kill and hmm_mirror_unref. It can do this by 
> calling synchronize_srcu() on the mmu_notifier's srcu. This has the 
> benefit that the driver is guaranteed not to get the "mm is dead" callback 
> after hmm_mirror_unregister returns.
> 
> In fact, are there any callbacks on the mirror that can arrive after 
> hmm_mirror_unregister? If so, how will hmm_device_unregister solve them?
> 
> From a general standpoint, hmm_device_unregister must perform some kind of 
> synchronization to be sure that all mirrors are completely released and 
> done and no new callbacks will trigger. Since that has to be true, can't 
> that synchronization be moved into hmm_mirror_unregister instead?
> 
> If that happens there's no need for a "mirror can be freed" ->release 
> callback at all because the driver is guaranteed that a mirror is done 
> after hmm_mirror_unregister.

Well there is no need or 2 callback (relase|stop , free) just one, the
release|stop that is needed. I kind of went halfway last week on this.
I will probably rework that a little to keep just one call and rely on
driver to call hmm_mirror_unregister()

Cheers,
Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 33/36] IB/odp/hmm: add core infiniband structure and helper for ODP with HMM.
  2015-05-21 20:23   ` [PATCH 33/36] IB/odp/hmm: add core infiniband structure and helper for ODP with HMM jglisse
@ 2015-06-24 13:59     ` Haggai Eran
  0 siblings, 0 replies; 22+ messages in thread
From: Haggai Eran @ 2015-06-24 13:59 UTC (permalink / raw)
  To: jglisse, akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Shachar Raindel, Liran Liss, Roland Dreier

On 21/05/2015 23:23, jglisse@redhat.com wrote:
> +int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem)
> +{
> +	struct mm_struct *mm = get_task_mm(current);
> +	struct ib_device *ib_device = context->device;
> +	struct ib_mirror *ib_mirror;
> +	struct pid *our_pid;
> +	int ret;
> +
> +	if (!mm || !ib_device->hmm_ready)
> +		return -EINVAL;
> +
> +	/* FIXME can this really happen ? */
No, following Yann Droneaud's patch 8abaae62f3fd ("IB/core: disallow
registering 0-sized memory region") ib_umem_get() checks against zero
sized umems.

> +	if (unlikely(ib_umem_start(umem) == ib_umem_end(umem)))
> +		return -EINVAL;
> +
> +	/* Prevent creating ODP MRs in child processes */
> +	rcu_read_lock();
> +	our_pid = get_task_pid(current->group_leader, PIDTYPE_PID);
> +	rcu_read_unlock();
> +	put_pid(our_pid);
> +	if (context->tgid != our_pid) {
> +		mmput(mm);
> +		return -EINVAL;
> +	}
> +
> +	umem->hugetlb = 0;
> +	umem->odp_data = kmalloc(sizeof(*umem->odp_data), GFP_KERNEL);
> +	if (umem->odp_data == NULL) {
> +		mmput(mm);
> +		return -ENOMEM;
> +	}
> +	umem->odp_data->private = NULL;
> +	umem->odp_data->umem = umem;
> +
> +	mutex_lock(&ib_device->hmm_mutex);
> +	/* Is there an existing mirror for this process mm ? */
> +	ib_mirror = ib_mirror_ref(context->ib_mirror);
> +	if (!ib_mirror)
> +		list_for_each_entry(ib_mirror, &ib_device->ib_mirrors, list) {
> +			if (ib_mirror->base.hmm->mm != mm)
> +				continue;
> +			ib_mirror = ib_mirror_ref(ib_mirror);
> +			break;
> +		}
> +
> +	if (ib_mirror == NULL ||
> +	    ib_mirror == list_first_entry(&ib_device->ib_mirrors,
> +					  struct ib_mirror, list)) {
Is the second check an attempt to check if the list_for_each_entry above
passed through all the entries and didn't break? Maybe I missing
something, but I think that would cause the ib_mirror to hold a pointer
such that ib_mirror->list == ib_mirrors (point to the list head), and
not to the first entry.

In any case, I think it would be more clear if you add another ib_mirror
variable for iterating the ib_mirrors list.

> +		/* We need to create a new mirror. */
> +		ib_mirror = kmalloc(sizeof(*ib_mirror), GFP_KERNEL);
> +		if (ib_mirror == NULL) {
> +			mutex_unlock(&ib_device->hmm_mutex);
> +			mmput(mm);
> +			return -ENOMEM;
> +		}
> +		kref_init(&ib_mirror->kref);
> +		init_rwsem(&ib_mirror->hmm_mr_rwsem);
> +		ib_mirror->umem_tree = RB_ROOT;
> +		ib_mirror->ib_device = ib_device;
> +
> +		ib_mirror->base.device = &ib_device->hmm_dev;
> +		ret = hmm_mirror_register(&ib_mirror->base);
> +		if (ret) {
> +			mutex_unlock(&ib_device->hmm_mutex);
> +			kfree(ib_mirror);
> +			mmput(mm);
> +			return ret;
> +		}
> +
> +		list_add(&ib_mirror->list, &ib_device->ib_mirrors);
> +		context->ib_mirror = ib_mirror_ref(ib_mirror);
> +	}
> +	mutex_unlock(&ib_device->hmm_mutex);
> +	umem->odp_data.ib_mirror = ib_mirror;
> +
> +	down_write(&ib_mirror->umem_rwsem);
> +	rbt_ib_umem_insert(&umem->odp_data->interval_tree, &mirror->umem_tree);
> +	up_write(&ib_mirror->umem_rwsem);
> +
> +	mmput(mm);
> +	return 0;
> +}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-06-24 13:59 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1432236705-4209-1-git-send-email-j.glisse@gmail.com>
2015-05-21 19:31 ` [PATCH 05/36] HMM: introduce heterogeneous memory management v3 j.glisse
2015-05-27  5:50   ` Aneesh Kumar K.V
2015-05-27 14:38     ` Jerome Glisse
2015-06-08 19:40   ` Mark Hairgrove
     [not found]     ` <alpine.DEB.2.00.1506081222270.27796-ptWJzH4JGIzJt4XymMeBgkn48jw8i0AO@public.gmane.org>
2015-06-08 21:17       ` Jerome Glisse
2015-06-09  1:54         ` Mark Hairgrove
     [not found]           ` <alpine.DEB.2.00.1506081841490.1802-ptWJzH4JGIzJt4XymMeBgkn48jw8i0AO@public.gmane.org>
2015-06-09 15:56             ` Jerome Glisse
2015-06-10  3:33               ` Mark Hairgrove
2015-06-10 15:42                 ` Jerome Glisse
2015-06-11  1:15                   ` Mark Hairgrove
2015-06-11 14:23                     ` Jerome Glisse
2015-06-11 22:26                       ` Mark Hairgrove
2015-06-15 14:32                         ` Jerome Glisse
     [not found] ` <1432239792-5002-1-git-send-email-jglisse@redhat.com>
2015-05-21 20:23   ` [PATCH 30/36] IB/mlx5: add a new paramter to mlx5_ib_update_mtt() for ODP with HMM jglisse
2015-05-21 20:23   ` [PATCH 31/36] IB/odp: export rbt_ib_umem_for_each_in_range() jglisse
2015-05-21 20:23   ` [PATCH 32/36] IB/odp/hmm: add new kernel option to use HMM for ODP jglisse
2015-05-21 20:23   ` [PATCH 33/36] IB/odp/hmm: add core infiniband structure and helper for ODP with HMM jglisse
2015-06-24 13:59     ` Haggai Eran
2015-05-21 20:23   ` [PATCH 34/36] IB/mlx5/hmm: add mlx5 HMM device initialization and callback jglisse
2015-05-21 20:23   ` [PATCH 35/36] IB/mlx5/hmm: add page fault support for ODP on HMM jglisse
2015-05-21 20:23   ` [PATCH 36/36] IB/mlx5/hmm: enable ODP using HMM jglisse
     [not found] <1432236233-4035-1-git-send-email-j.glisse@gmail.com>
     [not found] ` <1432236233-4035-1-git-send-email-j.glisse-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2015-05-21 19:23   ` [PATCH 05/36] HMM: introduce heterogeneous memory management v3 j.glisse-Re5JQEeQqe8AvxtiuMwx3w

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox