Linux userland API discussions
 help / color / mirror / Atom feed
* [PATCH 18/21] userfaultfd: UFFDIO_REMAP uABI
From: Andrea Arcangeli @ 2015-03-05 17:18 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api,
	Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko
In-Reply-To: <1425575884-2574-1-git-send-email-aarcange@redhat.com>

This implements the uABI of UFFDIO_REMAP.

Notably one mode bitflag is also forwarded (and in turn known) by the
lowlevel remap_pages method.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/uapi/linux/userfaultfd.h | 27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 61251e6..db6e99a 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -19,7 +19,8 @@
 #define UFFD_API_RANGE_IOCTLS			\
 	((__u64)1 << _UFFDIO_WAKE |		\
 	 (__u64)1 << _UFFDIO_COPY |		\
-	 (__u64)1 << _UFFDIO_ZEROPAGE)
+	 (__u64)1 << _UFFDIO_ZEROPAGE |		\
+	 (__u64)1 << _UFFDIO_REMAP)
 
 /*
  * Valid ioctl command number range with this API is from 0x00 to
@@ -34,6 +35,7 @@
 #define _UFFDIO_WAKE			(0x02)
 #define _UFFDIO_COPY			(0x03)
 #define _UFFDIO_ZEROPAGE		(0x04)
+#define _UFFDIO_REMAP			(0x05)
 #define _UFFDIO_API			(0x3F)
 
 /* userfaultfd ioctl ids */
@@ -50,6 +52,8 @@
 				      struct uffdio_copy)
 #define UFFDIO_ZEROPAGE		_IOWR(UFFDIO, _UFFDIO_ZEROPAGE,	\
 				      struct uffdio_zeropage)
+#define UFFDIO_REMAP		_IOWR(UFFDIO, _UFFDIO_REMAP,	\
+				      struct uffdio_remap)
 
 /*
  * Valid bits below PAGE_SHIFT in the userfault address read through
@@ -122,4 +126,25 @@ struct uffdio_zeropage {
 	__s64 wake;
 };
 
+struct uffdio_remap {
+	__u64 dst;
+	__u64 src;
+	__u64 len;
+	/*
+	 * Especially if used to atomically remove memory from the
+	 * address space the wake on the dst range is not needed.
+	 */
+#define UFFDIO_REMAP_MODE_DONTWAKE		((__u64)1<<0)
+#define UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES	((__u64)1<<1)
+	__u64 mode;
+
+	/*
+	 * "remap" and "wake" are written by the ioctl and must be at
+	 * the end: the copy_from_user will not read the last 16
+	 * bytes.
+	 */
+	__s64 remap;
+	__s64 wake;
+};
+
 #endif /* _LINUX_USERFAULTFD_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation
From: Andrea Arcangeli @ 2015-03-05 17:18 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api,
	Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko
In-Reply-To: <1425575884-2574-1-git-send-email-aarcange@redhat.com>

remap_pages is the lowlevel mm helper needed to implement
UFFDIO_REMAP.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/userfaultfd_k.h |  17 ++
 mm/huge_memory.c              | 120 ++++++++++
 mm/userfaultfd.c              | 526 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 663 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 587480a..3c39a4f 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -36,6 +36,23 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
 			      unsigned long dst_start,
 			      unsigned long len);
 
+/* remap_pages */
+extern void double_pt_lock(spinlock_t *ptl1, spinlock_t *ptl2);
+extern void double_pt_unlock(spinlock_t *ptl1, spinlock_t *ptl2);
+extern ssize_t remap_pages(struct mm_struct *dst_mm,
+			   struct mm_struct *src_mm,
+			   unsigned long dst_start,
+			   unsigned long src_start,
+			   unsigned long len, __u64 flags);
+extern int remap_pages_huge_pmd(struct mm_struct *dst_mm,
+				struct mm_struct *src_mm,
+				pmd_t *dst_pmd, pmd_t *src_pmd,
+				pmd_t dst_pmdval,
+				struct vm_area_struct *dst_vma,
+				struct vm_area_struct *src_vma,
+				unsigned long dst_addr,
+				unsigned long src_addr);
+
 /* mm helpers */
 static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
 					struct vm_userfaultfd_ctx vm_ctx)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1e25cb3..08c8afc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1531,6 +1531,124 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	return ret;
 }
 
+#ifdef CONFIG_USERFAULTFD
+/*
+ * The PT lock for src_pmd and the mmap_sem for reading are held by
+ * the caller, but it must return after releasing the
+ * page_table_lock. We're guaranteed the src_pmd is a pmd_trans_huge
+ * until the PT lock of the src_pmd is released. Just move the page
+ * from src_pmd to dst_pmd if possible. Return zero if succeeded in
+ * moving the page, -EAGAIN if it needs to be repeated by the caller,
+ * or other errors in case of failure.
+ */
+int remap_pages_huge_pmd(struct mm_struct *dst_mm,
+			 struct mm_struct *src_mm,
+			 pmd_t *dst_pmd, pmd_t *src_pmd,
+			 pmd_t dst_pmdval,
+			 struct vm_area_struct *dst_vma,
+			 struct vm_area_struct *src_vma,
+			 unsigned long dst_addr,
+			 unsigned long src_addr)
+{
+	pmd_t _dst_pmd, src_pmdval;
+	struct page *src_page;
+	struct anon_vma *src_anon_vma, *dst_anon_vma;
+	spinlock_t *src_ptl, *dst_ptl;
+	pgtable_t pgtable;
+
+	src_pmdval = *src_pmd;
+	src_ptl = pmd_lockptr(src_mm, src_pmd);
+
+	BUG_ON(!pmd_trans_huge(src_pmdval));
+	BUG_ON(pmd_trans_splitting(src_pmdval));
+	BUG_ON(!pmd_none(dst_pmdval));
+	BUG_ON(!spin_is_locked(src_ptl));
+	BUG_ON(!rwsem_is_locked(&src_mm->mmap_sem));
+	BUG_ON(!rwsem_is_locked(&dst_mm->mmap_sem));
+
+	src_page = pmd_page(src_pmdval);
+	BUG_ON(!PageHead(src_page));
+	BUG_ON(!PageAnon(src_page));
+	if (unlikely(page_mapcount(src_page) != 1)) {
+		spin_unlock(src_ptl);
+		return -EBUSY;
+	}
+
+	get_page(src_page);
+	spin_unlock(src_ptl);
+
+	mmu_notifier_invalidate_range_start(src_mm, src_addr,
+					    src_addr + HPAGE_PMD_SIZE);
+
+	/* block all concurrent rmap walks */
+	lock_page(src_page);
+
+	/*
+	 * split_huge_page walks the anon_vma chain without the page
+	 * lock. Serialize against it with the anon_vma lock, the page
+	 * lock is not enough.
+	 */
+	src_anon_vma = page_get_anon_vma(src_page);
+	if (!src_anon_vma) {
+		unlock_page(src_page);
+		put_page(src_page);
+		mmu_notifier_invalidate_range_end(src_mm, src_addr,
+						  src_addr + HPAGE_PMD_SIZE);
+		return -EAGAIN;
+	}
+	anon_vma_lock_write(src_anon_vma);
+
+	dst_ptl = pmd_lockptr(dst_mm, dst_pmd);
+	double_pt_lock(src_ptl, dst_ptl);
+	if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
+		     !pmd_same(*dst_pmd, dst_pmdval) ||
+		     page_mapcount(src_page) != 1)) {
+		double_pt_unlock(src_ptl, dst_ptl);
+		anon_vma_unlock_write(src_anon_vma);
+		put_anon_vma(src_anon_vma);
+		unlock_page(src_page);
+		put_page(src_page);
+		mmu_notifier_invalidate_range_end(src_mm, src_addr,
+						  src_addr + HPAGE_PMD_SIZE);
+		return -EAGAIN;
+	}
+
+	BUG_ON(!PageHead(src_page));
+	BUG_ON(!PageAnon(src_page));
+	/* the PT lock is enough to keep the page pinned now */
+	put_page(src_page);
+
+	dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
+	ACCESS_ONCE(src_page->mapping) = (struct address_space *) dst_anon_vma;
+	ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma, dst_addr);
+
+	if (!pmd_same(pmdp_clear_flush(src_vma, src_addr, src_pmd),
+		      src_pmdval))
+		BUG();
+	_dst_pmd = mk_huge_pmd(src_page, dst_vma->vm_page_prot);
+	_dst_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_dst_pmd), dst_vma);
+	set_pmd_at(dst_mm, dst_addr, dst_pmd, _dst_pmd);
+
+	pgtable = pgtable_trans_huge_withdraw(src_mm, src_pmd);
+	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	if (dst_mm != src_mm) {
+		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+		add_mm_counter(src_mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+	}
+	double_pt_unlock(src_ptl, dst_ptl);
+
+	anon_vma_unlock_write(src_anon_vma);
+	put_anon_vma(src_anon_vma);
+
+	/* unblock rmap walks */
+	unlock_page(src_page);
+
+	mmu_notifier_invalidate_range_end(src_mm, src_addr,
+					  src_addr + HPAGE_PMD_SIZE);
+	return 0;
+}
+#endif /* CONFIG_USERFAULTFD */
+
 /*
  * Returns 1 if a given pmd maps a stable (not under splitting) thp.
  * Returns -1 if it maps a thp under splitting. Returns 0 otherwise.
@@ -2484,6 +2602,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 * Prevent all access to pagetables with the exception of
 	 * gup_fast later hanlded by the ptep_clear_flush and the VM
 	 * handled by the anon_vma lock + PG_lock.
+	 *
+	 * remap_pages is prevented to race as well thanks to the mmap_sem.
 	 */
 	down_write(&mm->mmap_sem);
 	if (unlikely(khugepaged_test_exit(mm)))
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 3f4c0ef..49521af 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -265,3 +265,529 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
 {
 	return __mcopy_atomic(dst_mm, start, 0, len, true);
 }
+
+void double_pt_lock(spinlock_t *ptl1,
+		    spinlock_t *ptl2)
+	__acquires(ptl1)
+	__acquires(ptl2)
+{
+	spinlock_t *ptl_tmp;
+
+	if (ptl1 > ptl2) {
+		/* exchange ptl1 and ptl2 */
+		ptl_tmp = ptl1;
+		ptl1 = ptl2;
+		ptl2 = ptl_tmp;
+	}
+	/* lock in virtual address order to avoid lock inversion */
+	spin_lock(ptl1);
+	if (ptl1 != ptl2)
+		spin_lock_nested(ptl2, SINGLE_DEPTH_NESTING);
+	else
+		__acquire(ptl2);
+}
+
+void double_pt_unlock(spinlock_t *ptl1,
+		      spinlock_t *ptl2)
+	__releases(ptl1)
+	__releases(ptl2)
+{
+	spin_unlock(ptl1);
+	if (ptl1 != ptl2)
+		spin_unlock(ptl2);
+	else
+		__release(ptl2);
+}
+
+/*
+ * The mmap_sem for reading is held by the caller. Just move the page
+ * from src_pmd to dst_pmd if possible, and return true if succeeded
+ * in moving the page.
+ */
+static int remap_pages_pte(struct mm_struct *dst_mm,
+			   struct mm_struct *src_mm,
+			   pte_t *dst_pte, pte_t *src_pte, pmd_t *src_pmd,
+			   struct vm_area_struct *dst_vma,
+			   struct vm_area_struct *src_vma,
+			   unsigned long dst_addr,
+			   unsigned long src_addr,
+			   spinlock_t *dst_ptl,
+			   spinlock_t *src_ptl,
+			   __u64 mode)
+{
+	struct page *src_page;
+	swp_entry_t entry;
+	pte_t orig_src_pte, orig_dst_pte;
+	struct anon_vma *src_anon_vma, *dst_anon_vma;
+
+	spin_lock(dst_ptl);
+	orig_dst_pte = *dst_pte;
+	spin_unlock(dst_ptl);
+	if (!pte_none(orig_dst_pte))
+		return -EEXIST;
+
+	spin_lock(src_ptl);
+	orig_src_pte = *src_pte;
+	spin_unlock(src_ptl);
+	if (pte_none(orig_src_pte)) {
+		if (!(mode & UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES))
+			return -ENOENT;
+		else
+			/* nothing to do to remap an hole */
+			return 0;
+	}
+
+	if (pte_present(orig_src_pte)) {
+		/*
+		 * Pin the page while holding the lock to be sure the
+		 * page isn't freed under us
+		 */
+		spin_lock(src_ptl);
+		if (!pte_same(orig_src_pte, *src_pte)) {
+			spin_unlock(src_ptl);
+			return -EAGAIN;
+		}
+		src_page = vm_normal_page(src_vma, src_addr, orig_src_pte);
+		if (!src_page || !PageAnon(src_page) ||
+		    page_mapcount(src_page) != 1) {
+			spin_unlock(src_ptl);
+			return -EBUSY;
+		}
+
+		get_page(src_page);
+		spin_unlock(src_ptl);
+
+		/* block all concurrent rmap walks */
+		lock_page(src_page);
+
+		/*
+		 * page_referenced_anon walks the anon_vma chain
+		 * without the page lock. Serialize against it with
+		 * the anon_vma lock, the page lock is not enough.
+		 */
+		src_anon_vma = page_get_anon_vma(src_page);
+		if (!src_anon_vma) {
+			/* page was unmapped from under us */
+			unlock_page(src_page);
+			put_page(src_page);
+			return -EAGAIN;
+		}
+		anon_vma_lock_write(src_anon_vma);
+
+		double_pt_lock(dst_ptl, src_ptl);
+
+		if (!pte_same(*src_pte, orig_src_pte) ||
+		    !pte_same(*dst_pte, orig_dst_pte) ||
+		    page_mapcount(src_page) != 1) {
+			double_pt_unlock(dst_ptl, src_ptl);
+			anon_vma_unlock_write(src_anon_vma);
+			put_anon_vma(src_anon_vma);
+			unlock_page(src_page);
+			put_page(src_page);
+			return -EAGAIN;
+		}
+
+		BUG_ON(!PageAnon(src_page));
+		/* the PT lock is enough to keep the page pinned now */
+		put_page(src_page);
+
+		dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
+		ACCESS_ONCE(src_page->mapping) = ((struct address_space *)
+						  dst_anon_vma);
+		ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma,
+								 dst_addr);
+
+		if (!pte_same(ptep_clear_flush(src_vma, src_addr, src_pte),
+			      orig_src_pte))
+			BUG();
+
+		orig_dst_pte = mk_pte(src_page, dst_vma->vm_page_prot);
+		orig_dst_pte = maybe_mkwrite(pte_mkdirty(orig_dst_pte),
+					     dst_vma);
+
+		set_pte_at(dst_mm, dst_addr, dst_pte, orig_dst_pte);
+
+		if (dst_mm != src_mm) {
+			inc_mm_counter(dst_mm, MM_ANONPAGES);
+			dec_mm_counter(src_mm, MM_ANONPAGES);
+		}
+
+		double_pt_unlock(dst_ptl, src_ptl);
+
+		anon_vma_unlock_write(src_anon_vma);
+		put_anon_vma(src_anon_vma);
+
+		/* unblock rmap walks */
+		unlock_page(src_page);
+
+		mmu_notifier_invalidate_page(src_mm, src_addr);
+	} else {
+		entry = pte_to_swp_entry(orig_src_pte);
+		if (non_swap_entry(entry)) {
+			if (is_migration_entry(entry)) {
+				migration_entry_wait(src_mm, src_pmd,
+						     src_addr);
+				return -EAGAIN;
+			}
+			return -EFAULT;
+		}
+
+		if (swp_entry_swapcount(entry) != 1)
+			return -EBUSY;
+
+		double_pt_lock(dst_ptl, src_ptl);
+
+		if (!pte_same(*src_pte, orig_src_pte) ||
+		    !pte_same(*dst_pte, orig_dst_pte) ||
+		    swp_entry_swapcount(entry) != 1) {
+			double_pt_unlock(dst_ptl, src_ptl);
+			return -EAGAIN;
+		}
+
+		if (pte_val(ptep_get_and_clear(src_mm, src_addr, src_pte)) !=
+		    pte_val(orig_src_pte))
+			BUG();
+		set_pte_at(dst_mm, dst_addr, dst_pte, orig_src_pte);
+
+		if (dst_mm != src_mm) {
+			inc_mm_counter(dst_mm, MM_ANONPAGES);
+			dec_mm_counter(src_mm, MM_ANONPAGES);
+		}
+
+		double_pt_unlock(dst_ptl, src_ptl);
+	}
+
+	return 0;
+}
+
+/**
+ * remap_pages - remap arbitrary anonymous pages of an existing vma
+ * @dst_start: start of the destination virtual memory range
+ * @src_start: start of the source virtual memory range
+ * @len: length of the virtual memory range
+ *
+ * remap_pages() remaps arbitrary anonymous pages atomically in zero
+ * copy. It only works on non shared anonymous pages because those can
+ * be relocated without generating non linear anon_vmas in the rmap
+ * code.
+ *
+ * It is the ideal mechanism to handle userspace page faults. Normally
+ * the destination vma will have VM_USERFAULT set with
+ * madvise(MADV_USERFAULT) while the source vma will have VM_DONTCOPY
+ * set with madvise(MADV_DONTFORK).
+ *
+ * The thread receiving the page during the userland page fault
+ * (MADV_USERFAULT) will receive the faulting page in the source vma
+ * through the network, storage or any other I/O device (MADV_DONTFORK
+ * in the source vma avoids remap_pages() to fail with -EBUSY if the
+ * process forks before remap_pages() is called), then it will call
+ * remap_pages() to map the page in the faulting address in the
+ * destination vma.
+ *
+ * This userfaultfd command works purely via pagetables, so it's the
+ * most efficient way to move physical non shared anonymous pages
+ * across different virtual addresses. Unlike mremap()/mmap()/munmap()
+ * it does not create any new vmas. The mapping in the destination
+ * address is atomic.
+ *
+ * It only works if the vma protection bits are identical from the
+ * source and destination vma.
+ *
+ * It can remap non shared anonymous pages within the same vma too.
+ *
+ * If the source virtual memory range has any unmapped holes, or if
+ * the destination virtual memory range is not a whole unmapped hole,
+ * remap_pages() will fail respectively with -ENOENT or -EEXIST. This
+ * provides a very strict behavior to avoid any chance of memory
+ * corruption going unnoticed if there are userland race
+ * conditions. Only one thread should resolve the userland page fault
+ * at any given time for any given faulting address. This means that
+ * if two threads try to both call remap_pages() on the same
+ * destination address at the same time, the second thread will get an
+ * explicit error from this command.
+ *
+ * The command retval will return "len" is succesful. The command
+ * however can be interrupted by fatal signals or errors. If
+ * interrupted it will return the number of bytes successfully
+ * remapped before the interruption if any, or the negative error if
+ * none. It will never return zero. Either it will return an error or
+ * an amount of bytes successfully moved. If the retval reports a
+ * "short" remap, the remap_pages() command should be repeated by
+ * userland with src+retval, dst+reval, len-retval if it wants to know
+ * about the error that interrupted it.
+ *
+ * The UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES flag can be specified to
+ * prevent -ENOENT errors to materialize if there are holes in the
+ * source virtual range that is being remapped. The holes will be
+ * accounted as successfully remapped in the retval of the
+ * command. This is mostly useful to remap hugepage naturally aligned
+ * virtual regions without knowing if there are transparent hugepage
+ * in the regions or not, but preventing the risk of having to split
+ * the hugepmd during the remap.
+ *
+ * If there's any rmap walk that is taking the anon_vma locks without
+ * first obtaining the page lock (for example split_huge_page and
+ * page_referenced_anon), they will have to verify if the
+ * page->mapping has changed after taking the anon_vma lock. If it
+ * changed they should release the lock and retry obtaining a new
+ * anon_vma, because it means the anon_vma was changed by
+ * remap_pages() before the lock could be obtained. This is the only
+ * additional complexity added to the rmap code to provide this
+ * anonymous page remapping functionality.
+ */
+ssize_t remap_pages(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		    unsigned long dst_start, unsigned long src_start,
+		    unsigned long len, __u64 mode)
+{
+	struct vm_area_struct *src_vma, *dst_vma;
+	long err = -EINVAL;
+	pmd_t *src_pmd, *dst_pmd;
+	pte_t *src_pte, *dst_pte;
+	spinlock_t *dst_ptl, *src_ptl;
+	unsigned long src_addr, dst_addr;
+	int thp_aligned = -1;
+	ssize_t moved = 0;
+
+	/*
+	 * Sanitize the command parameters:
+	 */
+	BUG_ON(src_start & ~PAGE_MASK);
+	BUG_ON(dst_start & ~PAGE_MASK);
+	BUG_ON(len & ~PAGE_MASK);
+
+	/* Does the address range wrap, or is the span zero-sized? */
+	BUG_ON(src_start + len <= src_start);
+	BUG_ON(dst_start + len <= dst_start);
+
+	/*
+	 * Because these are read sempahores there's no risk of lock
+	 * inversion.
+	 */
+	down_read(&dst_mm->mmap_sem);
+	if (dst_mm != src_mm)
+		down_read(&src_mm->mmap_sem);
+
+	/*
+	 * Make sure the vma is not shared, that the src and dst remap
+	 * ranges are both valid and fully within a single existing
+	 * vma.
+	 */
+	src_vma = find_vma(src_mm, src_start);
+	if (!src_vma || (src_vma->vm_flags & VM_SHARED))
+		goto out;
+	if (src_start < src_vma->vm_start ||
+	    src_start + len > src_vma->vm_end)
+		goto out;
+
+	dst_vma = find_vma(dst_mm, dst_start);
+	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+		goto out;
+	if (dst_start < dst_vma->vm_start ||
+	    dst_start + len > dst_vma->vm_end)
+		goto out;
+
+	if (pgprot_val(src_vma->vm_page_prot) !=
+	    pgprot_val(dst_vma->vm_page_prot))
+		goto out;
+
+	/* only allow remapping if both are mlocked or both aren't */
+	if ((src_vma->vm_flags & VM_LOCKED) ^ (dst_vma->vm_flags & VM_LOCKED))
+		goto out;
+
+	/*
+	 * Be strict and only allow remap_pages if either the src or
+	 * dst range is registered in the userfaultfd to prevent
+	 * userland errors going unnoticed. As far as the VM
+	 * consistency is concerned, it would be perfectly safe to
+	 * remove this check, but there's no useful usage for
+	 * remap_pages ouside of userfaultfd registered ranges. This
+	 * is after all why it is an ioctl belonging to the
+	 * userfaultfd and not a syscall.
+	 *
+	 * Allow both vmas to be registered in the userfaultfd, just
+	 * in case somebody finds a way to make such a case useful.
+	 * Normally only one of the two vmas would be registered in
+	 * the userfaultfd.
+	 */
+	if (!dst_vma->vm_userfaultfd_ctx.ctx &&
+	    !src_vma->vm_userfaultfd_ctx.ctx)
+		goto out;
+
+	/*
+	 * FIXME: only allow remapping across anonymous vmas,
+	 * tmpfs should be added.
+	 */
+	if (src_vma->vm_ops || dst_vma->vm_ops)
+		goto out;
+
+	/*
+	 * Ensure the dst_vma has a anon_vma or this page
+	 * would get a NULL anon_vma when moved in the
+	 * dst_vma.
+	 */
+	err = -ENOMEM;
+	if (unlikely(anon_vma_prepare(dst_vma)))
+		goto out;
+
+	for (src_addr = src_start, dst_addr = dst_start;
+	     src_addr < src_start + len; ) {
+		spinlock_t *ptl;
+		pmd_t dst_pmdval;
+		BUG_ON(dst_addr >= dst_start + len);
+		src_pmd = mm_find_pmd(src_mm, src_addr);
+		if (unlikely(!src_pmd)) {
+			if (!(mode & UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES)) {
+				err = -ENOENT;
+				break;
+			} else {
+				src_pmd = mm_alloc_pmd(src_mm, src_addr);
+				if (unlikely(!src_pmd)) {
+					err = -ENOMEM;
+					break;
+				}
+			}
+		}
+		dst_pmd = mm_alloc_pmd(dst_mm, dst_addr);
+		if (unlikely(!dst_pmd)) {
+			err = -ENOMEM;
+			break;
+		}
+
+		dst_pmdval = pmd_read_atomic(dst_pmd);
+		/*
+		 * If the dst_pmd is mapped as THP don't
+		 * override it and just be strict.
+		 */
+		if (unlikely(pmd_trans_huge(dst_pmdval))) {
+			err = -EEXIST;
+			break;
+		}
+		if (pmd_trans_huge_lock(src_pmd, src_vma, &ptl) == 1) {
+			/*
+			 * Check if we can move the pmd without
+			 * splitting it. First check the address
+			 * alignment to be the same in src/dst.  These
+			 * checks don't actually need the PT lock but
+			 * it's good to do it here to optimize this
+			 * block away at build time if
+			 * CONFIG_TRANSPARENT_HUGEPAGE is not set.
+			 */
+			if (thp_aligned == -1)
+				thp_aligned = ((src_addr & ~HPAGE_PMD_MASK) ==
+					       (dst_addr & ~HPAGE_PMD_MASK));
+			if (!thp_aligned || (src_addr & ~HPAGE_PMD_MASK) ||
+			    !pmd_none(dst_pmdval) ||
+			    src_start + len - src_addr < HPAGE_PMD_SIZE) {
+				spin_unlock(ptl);
+				/* Fall through */
+				split_huge_page_pmd(src_vma, src_addr,
+						    src_pmd);
+			} else {
+				BUG_ON(dst_addr & ~HPAGE_PMD_MASK);
+				err = remap_pages_huge_pmd(dst_mm,
+							   src_mm,
+							   dst_pmd,
+							   src_pmd,
+							   dst_pmdval,
+							   dst_vma,
+							   src_vma,
+							   dst_addr,
+							   src_addr);
+				cond_resched();
+
+				if (!err) {
+					dst_addr += HPAGE_PMD_SIZE;
+					src_addr += HPAGE_PMD_SIZE;
+					moved += HPAGE_PMD_SIZE;
+				}
+
+				if ((!err || err == -EAGAIN) &&
+				    fatal_signal_pending(current))
+					err = -EINTR;
+
+				if (err && err != -EAGAIN)
+					break;
+
+				continue;
+			}
+		}
+
+		if (pmd_none(*src_pmd)) {
+			if (!(mode & UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES)) {
+				err = -ENOENT;
+				break;
+			} else {
+				if (unlikely(__pte_alloc(src_mm, src_vma,
+							 src_pmd, src_addr))) {
+					err = -ENOMEM;
+					break;
+				}
+			}
+		}
+
+		/*
+		 * We held the mmap_sem for reading so MADV_DONTNEED
+		 * can zap transparent huge pages under us, or the
+		 * transparent huge page fault can establish new
+		 * transparent huge pages under us.
+		 */
+		if (unlikely(pmd_trans_unstable(src_pmd))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (unlikely(pmd_none(dst_pmdval)) &&
+		    unlikely(__pte_alloc(dst_mm, dst_vma, dst_pmd,
+					 dst_addr))) {
+			err = -ENOMEM;
+			break;
+		}
+		/* If an huge pmd materialized from under us fail */
+		if (unlikely(pmd_trans_huge(*dst_pmd))) {
+			err = -EFAULT;
+			break;
+		}
+
+		BUG_ON(pmd_none(*dst_pmd));
+		BUG_ON(pmd_none(*src_pmd));
+		BUG_ON(pmd_trans_huge(*dst_pmd));
+		BUG_ON(pmd_trans_huge(*src_pmd));
+
+		dst_pte = pte_offset_map(dst_pmd, dst_addr);
+		src_pte = pte_offset_map(src_pmd, src_addr);
+		dst_ptl = pte_lockptr(dst_mm, dst_pmd);
+		src_ptl = pte_lockptr(src_mm, src_pmd);
+
+		err = remap_pages_pte(dst_mm, src_mm,
+				      dst_pte, src_pte, src_pmd,
+				      dst_vma, src_vma,
+				      dst_addr, src_addr,
+				      dst_ptl, src_ptl, mode);
+
+		pte_unmap(dst_pte);
+		pte_unmap(src_pte);
+		cond_resched();
+
+		if (!err) {
+			dst_addr += PAGE_SIZE;
+			src_addr += PAGE_SIZE;
+			moved += PAGE_SIZE;
+		}
+
+		if ((!err || err == -EAGAIN) &&
+		    fatal_signal_pending(current))
+			err = -EINTR;
+
+		if (err && err != -EAGAIN)
+			break;
+	}
+
+out:
+	up_read(&dst_mm->mmap_sem);
+	if (dst_mm != src_mm)
+		up_read(&src_mm->mmap_sem);
+	BUG_ON(moved < 0);
+	BUG_ON(err > 0);
+	BUG_ON(!moved && !err);
+	return moved ? moved : err;
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 20/21] userfaultfd: UFFDIO_REMAP
From: Andrea Arcangeli @ 2015-03-05 17:18 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api,
	Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko
In-Reply-To: <1425575884-2574-1-git-send-email-aarcange@redhat.com>

This remap ioctl allows to atomically move a page in or out of an
userfaultfd address space. It's more expensive than "copy" (and of
course more expensive than "zerofill") as it requires a TLB flush on
the source range for each ioctl, which is an expensive operation on
SMP. Especially if copying only a few pages at time, copying without
TLB flush is faster.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 6230f22..b4c7f25 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -892,6 +892,54 @@ out:
 	return ret;
 }
 
+static int userfaultfd_remap(struct userfaultfd_ctx *ctx,
+			     unsigned long arg)
+{
+	__s64 ret;
+	struct uffdio_remap uffdio_remap;
+	struct uffdio_remap __user *user_uffdio_remap;
+	struct userfaultfd_wake_range range;
+
+	user_uffdio_remap = (struct uffdio_remap __user *) arg;
+
+	ret = -EFAULT;
+	if (copy_from_user(&uffdio_remap, user_uffdio_remap,
+			   /* don't copy "remap" and "wake" last field */
+			   sizeof(uffdio_remap)-sizeof(__s64)*2))
+		goto out;
+
+	ret = validate_range(ctx->mm, uffdio_remap.dst, uffdio_remap.len);
+	if (ret)
+		goto out;
+	ret = validate_range(current->mm, uffdio_remap.src, uffdio_remap.len);
+	if (ret)
+		goto out;
+	ret = -EINVAL;
+	if (uffdio_remap.mode & ~(UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES|
+				  UFFDIO_REMAP_MODE_DONTWAKE))
+		goto out;
+
+	ret = remap_pages(ctx->mm, current->mm,
+			  uffdio_remap.dst, uffdio_remap.src,
+			  uffdio_remap.len, uffdio_remap.mode);
+	if (unlikely(put_user(ret, &user_uffdio_remap->remap)))
+		return -EFAULT;
+	if (ret < 0)
+		goto out;
+	/* len == 0 would wake all */
+	BUG_ON(!ret);
+	range.len = ret;
+	if (!(uffdio_remap.mode & UFFDIO_REMAP_MODE_DONTWAKE)) {
+		range.start = uffdio_remap.dst;
+		ret = wake_userfault(ctx, &range);
+		if (unlikely(put_user(ret, &user_uffdio_remap->wake)))
+			return -EFAULT;
+	}
+	ret = range.len == uffdio_remap.len ? 0 : -EAGAIN;
+out:
+	return ret;
+}
+
 /*
  * userland asks for a certain API version and we return which bits
  * and ioctl commands are implemented in this kernel for such API
@@ -955,6 +1003,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
 	case UFFDIO_ZEROPAGE:
 		ret = userfaultfd_zeropage(ctx, arg);
 		break;
+	case UFFDIO_REMAP:
+		ret = userfaultfd_remap(ctx, arg);
+		break;
 	}
 	return ret;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 21/21] userfaultfd: add userfaultfd_wp mm helpers
From: Andrea Arcangeli @ 2015-03-05 17:18 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api,
	Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko
In-Reply-To: <1425575884-2574-1-git-send-email-aarcange@redhat.com>

These helpers will be used to know if to call handle_userfault() during
wrprotect faults in order to deliver the wrprotect faults to userland.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/userfaultfd_k.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 3c39a4f..81f0d11 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -65,6 +65,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_UFFD_MISSING;
 }
 
+static inline bool userfaultfd_wp(struct vm_area_struct *vma)
+{
+	return vma->vm_flags & VM_UFFD_WP;
+}
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
 	return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
@@ -92,6 +97,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma)
 	return false;
 }
 
+static inline bool userfaultfd_wp(struct vm_area_struct *vma)
+{
+	return false;
+}
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
 	return false;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [PATCHv2 0/2] N900 Modem Speech Support
From: Kai Vehmanen @ 2015-03-05 17:32 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Sebastian Reichel, Peter Ujfalusi, Kai Vehmanen, Pali Rohar,
	Aaro Koskinen, Ivaylo Dimitrov, linux-omap, linux-kernel,
	linux-api
In-Reply-To: <20150305113013.GA3867@amd>

Hi,

On Thu, 5 Mar 2015, Pavel Machek wrote:

>> Userland access goes via /dev/cmt_speech. The API is implemented in
>> libcmtspeechdata, which is used by ofono and the freesmartphone.org project.
> Yes, the ABI is "tested" for some years, but it is not documented, and
> it is very wrong ABI.
>
> I'm not sure what they do with the "read()". I was assuming it is
> meant for passing voice data, but it can return at most 4 bytes,
> AFAICT.
>
> We already have perfectly good ABI for passing voice data around. It
> is called "ALSA". libcmtspeech will then become unneccessary, and the
> daemon routing voice data will be as simple as "read sample from

I'm no longer involved with cmt_speech (with this driver nor modems in 
general), but let me clarify some bits about the design.

First, the team that designed the driver and the stack above had a lot of 
folks working also with ALSA (and the ALSA drivers have been merged to 
mainline long ago) and we considered ALSA on multiple occasions as the 
interface for this as well.

Our take was that ALSA is not the right interface for cmt_speech. The 
cmt_speech interface in the modem is _not_ a PCM interface as modelled by 
ALSA. Specifically:

- the interface is lossy in both directions
- data is sent in packets, not a stream of samples (could be other things
   than PCM samples), with timing and meta-data
- timing of uplink is of utmost importance

Some definite similarities:
  - the mmap interface to manage the PCM buffers (that is on purpose
    similar to that of ALSA)

The interface was designed so that the audio mixer (e.g. Pulseaudio) is 
run with a soft real-time SCHED_FIFO/RR user-space thread that has full 
control over _when_ voice _packets_ are sent, and can receive packets with 
meta-data (see libcmtspeechdata interface, cmtspeech.h), and can 
detect and handle gaps in the received packets.

This is very different from modems that offer an actual PCM voice link for 
example over I2S to the application processor (there are lots of these on 
the market). When you walk out of coverage during a call with these 
modems, you'll still get samples over I2S, but not so with cmt_speech, so 
ALSA is not the right interface.

Now, I'm not saying the interface is perfect, but just to give a bit of 
background, why a custom char-device interface was chosen.

PS Not saying it's enough for mainline inclusion, but libcmtspeechdata [1]
    was released and documented to enable the driver to be used by
    other software than the closed pulseaudio modules. You Pavel of course
    know this as you've been maintaining the library, but FYI for others.

[1] https://www.gitorious.org/libcmtspeechdata

Br, Kai

^ permalink raw reply

* Re: [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation
From: Linus Torvalds @ 2015-03-05 17:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel, KVM list, Linux Kernel Mailing List, linux-mm,
	Linux API, Android Kernel Team, Kirill A. Shutemov,
	Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Andrew Morton, Sasha Levin,
	Hugh Dickins, Peter Feiner, Dr. David Alan Gilbert,
	Christopher Covington, Johannes Weiner
In-Reply-To: <1425575884-2574-20-git-send-email-aarcange@redhat.com>

On Thu, Mar 5, 2015 at 9:18 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> remap_pages is the lowlevel mm helper needed to implement
> UFFDIO_REMAP.

This function is nasty nasty nasty.

Is this really worth it? On real loads? That people are expected to use?

Considering how we just got rid of one special magic VM remapping
thing that nobody actually used, I'd really hate to add a new one.

The fact is, almost nobody ever uses anything that isn't standard
POSIX. There are no apps, and even for specialized things like
virtualization hypervisors this kind of thing is often simply not
worth it.

Quite frankly, *if* we ever merge userfaultfd, I would *strongly*
argue for not merging the remap parts. I just don't see the point. It
doesn't seem to add anything that is semantically very important -
it's *potentially* a faster copy, but even that is

  (a) questionable in the first place

and

 (b) unclear why anybody would ever care about performance of
infrastructure that nobody actually uses today, and future use isn't
even clear or shown to be particualrly performance-sensitive.

So basically I'd like to see better documentation, a few real use
cases (and by real I very much do *not* mean "you can use it for
this", but actual patches to actual projects that matter and that are
expected to care and merge them), and a simplified series that doesn't
do the remap thing.

Because *every* time we add a new clever interface, we end up with
approximately zero users and just pain down the line. Examples:
splice, mremap, yadda yadda.

                        Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 05/21] userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct
From: Pavel Emelyanov @ 2015-03-05 17:48 UTC (permalink / raw)
  To: Andrea Arcangeli, qemu-devel, kvm, linux-kernel, linux-mm,
	linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner, Dr. David Alan Gilbert,
	Christopher Covington, Johannes Weiner, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro <kosaki.mo>
In-Reply-To: <1425575884-2574-6-git-send-email-aarcange@redhat.com>

> diff --git a/kernel/fork.c b/kernel/fork.c
> index cf65139..cb215c0 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -425,6 +425,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
>  			goto fail_nomem_anon_vma_fork;
>  		tmp->vm_flags &= ~VM_LOCKED;
>  		tmp->vm_next = tmp->vm_prev = NULL;
> +		tmp->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;

This creates an interesting effect when the userfaultfd is used outside of
the process which created and activated one. If I try to monitor the memory
usage of one task with another, once the first task fork()-s, its child
begins to see zero-pages in the places where the monitor task was supposed
to insert pages with data.

>  		file = tmp->vm_file;
>  		if (file) {
>  			struct inode *inode = file_inode(file);
> .
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 10/21] userfaultfd: add new syscall to provide memory externalization
From: Pavel Emelyanov @ 2015-03-05 17:57 UTC (permalink / raw)
  To: Andrea Arcangeli, qemu-devel, kvm, linux-kernel, linux-mm,
	linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner, Dr. David Alan Gilbert,
	Christopher Covington, Johannes Weiner, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro <kosaki.mo>
In-Reply-To: <1425575884-2574-11-git-send-email-aarcange@redhat.com>


> +int handle_userfault(struct vm_area_struct *vma, unsigned long address,
> +		     unsigned int flags, unsigned long reason)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct userfaultfd_ctx *ctx;
> +	struct userfaultfd_wait_queue uwq;
> +
> +	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
> +
> +	ctx = vma->vm_userfaultfd_ctx.ctx;
> +	if (!ctx)
> +		return VM_FAULT_SIGBUS;
> +
> +	BUG_ON(ctx->mm != mm);
> +
> +	VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
> +	VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
> +
> +	/*
> +	 * If it's already released don't get it. This avoids to loop
> +	 * in __get_user_pages if userfaultfd_release waits on the
> +	 * caller of handle_userfault to release the mmap_sem.
> +	 */
> +	if (unlikely(ACCESS_ONCE(ctx->released)))
> +		return VM_FAULT_SIGBUS;
> +
> +	/* check that we can return VM_FAULT_RETRY */
> +	if (unlikely(!(flags & FAULT_FLAG_ALLOW_RETRY))) {
> +		/*
> +		 * Validate the invariant that nowait must allow retry
> +		 * to be sure not to return SIGBUS erroneously on
> +		 * nowait invocations.
> +		 */
> +		BUG_ON(flags & FAULT_FLAG_RETRY_NOWAIT);
> +#ifdef CONFIG_DEBUG_VM
> +		if (printk_ratelimit()) {
> +			printk(KERN_WARNING
> +			       "FAULT_FLAG_ALLOW_RETRY missing %x\n", flags);
> +			dump_stack();
> +		}
> +#endif
> +		return VM_FAULT_SIGBUS;
> +	}
> +
> +	/*
> +	 * Handle nowait, not much to do other than tell it to retry
> +	 * and wait.
> +	 */
> +	if (flags & FAULT_FLAG_RETRY_NOWAIT)
> +		return VM_FAULT_RETRY;
> +
> +	/* take the reference before dropping the mmap_sem */
> +	userfaultfd_ctx_get(ctx);
> +
> +	/* be gentle and immediately relinquish the mmap_sem */
> +	up_read(&mm->mmap_sem);
> +
> +	init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
> +	uwq.wq.private = current;
> +	uwq.address = userfault_address(address, flags, reason);

Since we report only the virtual address of the fault, this will make difficulties
for task monitoring the address space of some other task. Like this:

Let's assume a task creates a userfaultfd, activates one, registers several VMAs 
in it and then sends the ufd descriptor to other task. If later the first task will
remap those VMAs and will start touching pages, the monitor will start receiving 
fault addresses using which it will not be able to guess the exact vma the
requests come from.

Thanks,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation
From: Pavel Emelyanov @ 2015-03-05 18:01 UTC (permalink / raw)
  To: Andrea Arcangeli, qemu-devel, kvm, linux-kernel, linux-mm,
	linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner, Dr. David Alan Gilbert,
	Christopher Covington, Johannes Weiner, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro <kosaki.mo>
In-Reply-To: <1425575884-2574-20-git-send-email-aarcange@redhat.com>

> +ssize_t remap_pages(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +		    unsigned long dst_start, unsigned long src_start,
> +		    unsigned long len, __u64 mode)
> +{
> +	struct vm_area_struct *src_vma, *dst_vma;
> +	long err = -EINVAL;
> +	pmd_t *src_pmd, *dst_pmd;
> +	pte_t *src_pte, *dst_pte;
> +	spinlock_t *dst_ptl, *src_ptl;
> +	unsigned long src_addr, dst_addr;
> +	int thp_aligned = -1;
> +	ssize_t moved = 0;
> +
> +	/*
> +	 * Sanitize the command parameters:
> +	 */
> +	BUG_ON(src_start & ~PAGE_MASK);
> +	BUG_ON(dst_start & ~PAGE_MASK);
> +	BUG_ON(len & ~PAGE_MASK);
> +
> +	/* Does the address range wrap, or is the span zero-sized? */
> +	BUG_ON(src_start + len <= src_start);
> +	BUG_ON(dst_start + len <= dst_start);
> +
> +	/*
> +	 * Because these are read sempahores there's no risk of lock
> +	 * inversion.
> +	 */
> +	down_read(&dst_mm->mmap_sem);
> +	if (dst_mm != src_mm)
> +		down_read(&src_mm->mmap_sem);
> +
> +	/*
> +	 * Make sure the vma is not shared, that the src and dst remap
> +	 * ranges are both valid and fully within a single existing
> +	 * vma.
> +	 */
> +	src_vma = find_vma(src_mm, src_start);
> +	if (!src_vma || (src_vma->vm_flags & VM_SHARED))
> +		goto out;
> +	if (src_start < src_vma->vm_start ||
> +	    src_start + len > src_vma->vm_end)
> +		goto out;
> +
> +	dst_vma = find_vma(dst_mm, dst_start);
> +	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
> +		goto out;

I again have a concern about the case when one task monitors the VM of the
other one. If the target task (owning the mm) unmaps a VMA then the monitor
task (holding and operating on the ufd) will get plain EINVAL on UFFDIO_REMAP
request. This is not fatal, but still inconvenient as it will be hard to
find out the reason for failure -- dst VMA is removed and the monitor should
just drop the respective pages with data, or some other error has occurred
and some other actions should be taken.

Thanks,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 14/21] userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation
From: Pavel Emelyanov @ 2015-03-05 18:07 UTC (permalink / raw)
  To: Andrea Arcangeli, qemu-devel, kvm, linux-kernel, linux-mm,
	linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner, Dr. David Alan Gilbert,
	Christopher Covington, Johannes Weiner, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro <kosaki.mo>
In-Reply-To: <1425575884-2574-15-git-send-email-aarcange@redhat.com>

> +static int mcopy_atomic_pte(struct mm_struct *dst_mm,
> +			    pmd_t *dst_pmd,
> +			    struct vm_area_struct *dst_vma,
> +			    unsigned long dst_addr,
> +			    unsigned long src_addr)
> +{
> +	struct mem_cgroup *memcg;
> +	pte_t _dst_pte, *dst_pte;
> +	spinlock_t *ptl;
> +	struct page *page;
> +	void *page_kaddr;
> +	int ret;
> +
> +	ret = -ENOMEM;
> +	page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, dst_vma, dst_addr);
> +	if (!page)
> +		goto out;

Not a fatal thing, but still quite inconvenient. If there are two tasks that
have anonymous private VMAs that are still not COW-ed from each other, then
it will be impossible to keep the pages shared with userfault. Thus if we do
post-copy memory migration for tasks, then these guys will have their
memory COW-ed.


Thanks,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 00/21] RFC: userfaultfd v3
From: Pavel Emelyanov @ 2015-03-05 18:15 UTC (permalink / raw)
  To: Andrea Arcangeli, qemu-devel, kvm, linux-kernel, linux-mm,
	linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner, Dr. David Alan Gilbert,
	Christopher Covington, Johannes Weiner, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro <kosaki.mo>
In-Reply-To: <1425575884-2574-1-git-send-email-aarcange@redhat.com>


> All UFFDIO_COPY/ZEROPAGE/REMAP methods already support CRIU postcopy
> live migration and the UFFD can be passed to a manager process through
> unix domain sockets to satisfy point 5).

Yup :) That's the best (from my POV) point of ufd -- the ability to delegate
the descriptor to some other task. Though there are several limitations (I've
expressed them in other e-mails), I'm definitely supporting this!

The respective CRIU code is quite sloppy yet, I will try to brush one up and
show soon.

Thanks,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v1 4/6] eeprom: sunxi: Move the SID driver to the eeprom framework
From: Maxime Ripard @ 2015-03-05 18:36 UTC (permalink / raw)
  To: Paul Bolle
  Cc: Srinivas Kandagatla,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Rob Herring,
	Pawel Moll, Kumar Gala, linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA, Stephen Boyd,
	andrew-g2DYL2Zd6BY, Arnd Bergmann, broonie-DgEjT+Ai2ygdnm+yROfE0A,
	Greg Kroah-Hartman
In-Reply-To: <1425550515.24292.212.camel@x220>

[-- Attachment #1: Type: text/plain, Size: 2338 bytes --]

Hi Paul,

On Thu, Mar 05, 2015 at 11:15:15AM +0100, Paul Bolle wrote:
> >  endif
> > diff --git a/drivers/eeprom/Makefile b/drivers/eeprom/Makefile
> > index e130079..661422c 100644
> > --- a/drivers/eeprom/Makefile
> > +++ b/drivers/eeprom/Makefile
> > @@ -7,3 +7,4 @@ ccflags-$(CONFIG_EEPROM_DEBUG) += -DDEBUG
> >  obj-$(CONFIG_EEPROM)		+= core.o
> >  
> >  # Devices
> > +obj-$(CONFIG_EEPROM_SUNXI_SID)	+= eeprom-sunxi-sid.o
> > diff --git a/drivers/eeprom/eeprom-sunxi-sid.c b/drivers/eeprom/eeprom-sunxi-sid.c
> > new file mode 100644
> > index 0000000..eb32afb
> > --- /dev/null
> > +++ b/drivers/eeprom/eeprom-sunxi-sid.c
> > @@ -0,0 +1,129 @@
> > +/*
> > + * Allwinner sunXi SoCs Security ID support.
> > + *
> > + * Copyright (c) 2013 Oliver Schinagl <oliver-dxLnbx3+1qmEVqv0pETR8A@public.gmane.org>
> > + * Copyright (C) 2014 Maxime Ripard <maxime.ripard-wi1+55ScJUtKEb57/3fJTNBPR1lH4CV8@public.gmane.org>
> > + *
> > + * This file is licensed under the terms of the GNU General Public
> > + * License version 2.  This program is licensed "as is" without any
> > + * warranty of any kind, whether express or implied.
> > + */
> 
> So the license is GPL v2.
> 
> > +MODULE_LICENSE("GPL");
> 
> Which means you probably want
>     MODULE_LICENSE("GPL v2");
> 
> > --- a/drivers/misc/eeprom/sunxi_sid.c
> > +++ /dev/null
> > @@ -1,156 +0,0 @@
> > -/*
> > - * Copyright (c) 2013 Oliver Schinagl <oliver-dxLnbx3+1qmEVqv0pETR8A@public.gmane.org>
> > - * http://www.linux-sunxi.org
> > - *
> > - * This program is free software; you can redistribute it and/or modify
> > - * it under the terms of the GNU General Public License as published by
> > - * the Free Software Foundation; either version 2 of the License, or
> > - * (at your option) any later version.
> 
> But the previous driver was GPL v2 or later.
> 
> > -MODULE_LICENSE("GPL");
> 
> And this matches that.
> 
> Was it intended to re-license this, or is the code basically new? (I
> haven't compared the before and after code, to be honest.)

That was unintentional, especially since this driver is not new at
all, and is barely a conversion to that framework.

Thanks for catching this!
Maxime

-- 
Maxime Ripard, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PATCH] capabilities: Ambient capability set V2
From: Christoph Lameter @ 2015-03-05 18:41 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Serge Hallyn, Andy Lutomirski, Jonathan Corbet, Aaron Jones,
	linux-security-module, linux-kernel, akpm, Andrew G. Morgan,
	Mimi Zohar, Austin S Hemmelgarn, Markku Savela, Jarkko Sakkinen,
	linux-api, Michael Kerrisk
In-Reply-To: <20150305171326.GA14998@mail.hallyn.com>

On Thu, 5 Mar 2015, Serge E. Hallyn wrote:

> > >
> > > So I'd say drop this change ^
> >
> > Then the ambient caps get ignored for a executables that have capabilities
> > seton the file?
>
> Yes.  Those are assumed to already know what they're doing.

Do they? What if there is a LD_PRELOAD library that redirects socket calls
and that needs raw device access (there are actually a number of software
packages like that to reduce the latency of network I/O. See for example
Solarflare's software products and the current rsocket libary in OFED.
There are cap issues if the rsocket library should be made useful for
Ethernet instead of infiniband).

> Why?  Do you foresee cases where a file that has fP set needs capabilities
> that aren't in its fP?

Yes due to the library issues.

> It seems more likely that they'll risk misbehaving due to an unexpected set
> of caps.

The userspace driver code in the library wont work since it does not have
the caps to access the raw device registers.

^ permalink raw reply

* Re: [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation
From: Andrea Arcangeli @ 2015-03-05 18:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: qemu-devel, KVM list, Linux Kernel Mailing List, linux-mm,
	Linux API, Android Kernel Team, Kirill A. Shutemov,
	Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Andrew Morton, Sasha Levin,
	Hugh Dickins, Peter Feiner, Dr. David 
In-Reply-To: <CA+55aFzW=qaO0iKZWK9BWDNHu4eOgiKOJ-=0SvzsmZawuH5_3A@mail.gmail.com>

On Thu, Mar 05, 2015 at 09:39:48AM -0800, Linus Torvalds wrote:
> Is this really worth it? On real loads? That people are expected to use?

I fully agree that it's not worth merging upstream UFFDIO_REMAP until
(and if) a real world usage for it will showup. To further clarify:
would this not have been an RFC, the patchset would have stopped at
patch number 15/21 included.

Merging UFFDIO_REMAP with no real life users, would just increase the
attack vector surface of the kernel for no good.

Thanks for your idea that the UFFDIO_COPY is faster, the userland code
we submitted for qemu only uses UFFDIO_COPY|ZEROPAGE, it never uses
UFFDIO_REMAP. I immediately agreed about UFFDIO_COPY being preferable
after you mentioned it during review of the previous RFC.

However this being a RFC with a large audience, and UFFDIO_REMAP
allowing to "remove" memory (think like externalizing memory into to
ceph with deduplication or such), I still added it just in case there
are real world use cases that may justify me keeping it around (even
if I would definitely not have submitted it for merging in the short
term regardless).

In addition of dropping the parts that aren't suitable for merging in
the short term like UFFDIO_REMAP, for any further submits that won't
substantially alter the API like it happened between the v2 to v3
RFCs, I'll also shrink the To/Cc list considerably.

> Considering how we just got rid of one special magic VM remapping
> thing that nobody actually used, I'd really hate to add a new one.

Having to define an API somehow, I tried to think at all possible
future usages and make sure the API would allow for those if needed.

> Quite frankly, *if* we ever merge userfaultfd, I would *strongly*
> argue for not merging the remap parts. I just don't see the point. It
> doesn't seem to add anything that is semantically very important -
> it's *potentially* a faster copy, but even that is
> 
>   (a) questionable in the first place

Yes, we already measured the UFFDIO_COPY is faster than UFFDIO_REMAP,
the userfault latency decreases -20%.

> 
> and
> 
>  (b) unclear why anybody would ever care about performance of
> infrastructure that nobody actually uses today, and future use isn't
> even clear or shown to be particualrly performance-sensitive.

The only potential _theoretical_ case that justify the existence of
UFFDIO_REMAP is about "removing" memory from the address space. To
"add" memory UFFDIO_COPY and UFFDIO_ZEROPAGE are always preferable
like you suggested.

> So basically I'd like to see better documentation, a few real use
> cases (and by real I very much do *not* mean "you can use it for
> this", but actual patches to actual projects that matter and that are
> expected to care and merge them), and a simplified series that doesn't
> do the remap thing.

So far I wrote some doc in 2/21 and in the cover letter, but certainly
more docs are necessary. Trinity is also needed (I got trinity running
on the v2 API but I haven't adapted to the new API yet).

About the real world usages, this is the primary one:

http://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04873.html

And it actually cannot be merged in qemu until userfaultfd is merged
in the kernel. There's simply no safe way to implement postcopy live
migration without something equivalent to the userfaultfd if all Linux
VM features are intended to be retained in the destination node.

> Because *every* time we add a new clever interface, we end up with
> approximately zero users and just pain down the line. Examples:
> splice, mremap, yadda yadda.

Aside from mremap which I think is widely used, I totally agree in
principle.

For now I can quite comfortably guarantee the above real life user for
userfaultfd (qemu), but there are potential 5 of them. And none needs
UFFDIO_REMAP, which is again why I totally agree of not submitting it
for merging and it was intended it only for the initial RFC to share
the idea of "removing" the memory with a larger audience before I
shrink the Cc/To list for further updates.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2/9] selftests: Add install target
From: Dave Jones @ 2015-03-05 18:53 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	shuahkh-JPH+aEBZ4P+UEJcrhfAQsw, mmarek-AlSwsSmVLrQ,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1425358302-16680-2-git-send-email-mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>

On Tue, Mar 03, 2015 at 03:51:35PM +1100, Michael Ellerman wrote:
 > This adds make install support to selftests. The basic usage is:
 > 
 > $ cd tools/testing/selftests
 > $ make install
 > 
 > That installs into tools/testing/selftests/install, which can then be
 > copied where ever necessary.
 > 
 > The install destination is also configurable using eg:
 > 
 > $ INSTALL_PATH=/mnt/selftests make install

 ...

 > +	@# Ask all targets to emit their test scripts
 > +	echo "#!/bin/bash\n\n" > $(ALL_SCRIPT)

$ ./all.sh 
-bash: ./all.sh: /bin/bash\n\n: bad interpreter: No such file or directory

Removing the \n\n fixes it.

 > +		echo "cd \$$ROOT\n" >> $(ALL_SCRIPT); \

ditto

	Dave

^ permalink raw reply

* Re: [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation
From: Linus Torvalds @ 2015-03-05 19:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel-qX2TKyscuCcdnm+yROfE0A, KVM list,
	Linux Kernel Mailing List, linux-mm, Linux API,
	Android Kernel Team, Kirill A. Shutemov, Pavel Emelyanov,
	Sanidhya Kashyap, zhang.zhanghailiang-hv44wF8Li93QT0dZR+AlfA,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Andrew Morton, Sasha Levin,
	Hugh Dickins, Peter Feiner, Dr. David Alan Gilbert,
	Christopher Covington, Johannes Weiner
In-Reply-To: <20150305185112.GL4280-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Thu, Mar 5, 2015 at 10:51 AM, Andrea Arcangeli <aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
> Thanks for your idea that the UFFDIO_COPY is faster, the userland code
> we submitted for qemu only uses UFFDIO_COPY|ZEROPAGE, it never uses
> UFFDIO_REMAP.

Ok. So there's no actual expected use of the remap interface. Good.
That makes this series more palatable, since the rest didn't raise my
hackles much.

(But yeah, the documentation patch didn't really explain the uses very
much or at all, so I think something more is needed in that area).

                   Linus

^ permalink raw reply

* Re: [PATCH v1 3/6] eeprom: Add bindings for simple eeprom framework
From: Rob Herring @ 2015-03-05 20:11 UTC (permalink / raw)
  To: Srinivas Kandagatla
  Cc: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	Maxime Ripard, Rob Herring, Pawel Moll, Kumar Gala,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Stephen Boyd,
	Andrew Lunn, Arnd Bergmann, Mark Brown, Greg Kroah-Hartman
In-Reply-To: <1425548765-13019-1-git-send-email-srinivas.kandagatla-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>

On Thu, Mar 5, 2015 at 3:46 AM, Srinivas Kandagatla
<srinivas.kandagatla-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
> This patch adds bindings for simple eeprom framework which allows eeprom
> consumers to talk to eeprom providers to get access to eeprom cell data.
>
> Signed-off-by: Maxime Ripard <maxime.ripard-wi1+55ScJUtKEb57/3fJTNBPR1lH4CV8@public.gmane.org>
> [Maxime Ripard: intial version of eeprom framework]
> Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
> ---
>  .../devicetree/bindings/eeprom/eeprom.txt          | 70 ++++++++++++++++++++++
>  1 file changed, 70 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/eeprom/eeprom.txt
>
> diff --git a/Documentation/devicetree/bindings/eeprom/eeprom.txt b/Documentation/devicetree/bindings/eeprom/eeprom.txt
> new file mode 100644
> index 0000000..dbfb95c
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/eeprom/eeprom.txt
> @@ -0,0 +1,70 @@
> += EEPROM Data Device Tree Bindings =
> +
> +This binding is intended to represent the location of hardware
> +configuration data stored in EEPROMs.
> +
> +On a significant proportion of boards, the manufacturer has stored
> +some data on an EEPROM-like device, for the OS to be able to retrieve
> +these information and act upon it. Obviously, the OS has to know
> +about where to retrieve these data from, and where they are stored on
> +the storage device.
> +
> +This document is here to document this.
> +
> += Data providers =
> +Contains bindings specific to provider drivers and data cells as children
> +to this node.
> +
> += Data cells =
> +These are the child nodes of the provider which contain data cell
> +information like offset and size in eeprom provider.
> +
> +Required properties:
> +reg:   specifies the offset in byte within that storage device, and the length
> +       in bytes of the data we care about.
> +       There could be more then one offset-length pairs in this property.
> +
> +Optional properties:
> +As required by specific data parsers/interpreters.
> +
> +For example:
> +
> +       /* Provider */
> +       qfprom: qfprom@00700000 {
> +               compatible      = "qcom,qfprom";
> +               reg             = <0x00700000 0x1000>;
> +               ...
> +
> +               /* Data cells */
> +               tsens_calibration: calib@404 {
> +                       reg = <0x404 0x10>;
> +               };
> +
> +               serial_number: sn {
> +                       reg = <0x104 0x4>, <0x204 0x4>, <0x30c 0x4>;
> +
> +               };
> +               ...
> +       };
> +
> += Data consumers =
> +Are drivers which consume eeprom data cells.

s/drivers/device nodes/

> +
> +Required properties:
> +
> +eeproms: List of phandle and data cell the device might be interested in.
> +
> +Optional properties:
> +
> +eeprom-names: List of data cell name strings sorted in the same order
> +             as the resets property. Consumers drivers will use

resets?

> +             eeprom-names to differentiate between multiple cells,
> +             and hence being able to know what these cells are for.

Is this still needed? The sub-node name defines the name. Or you can
use reg-names with-in the sub-node.

Rob

> +
> +For example:
> +
> +       tsens {
> +               ...
> +               eeproms = <&tsens_calibration>;
> +               eeprom-names = "calibration";
> +       };
> --
> 1.9.1
>

^ permalink raw reply

* Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode
From: Jason Baron @ 2015-03-05 20:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, normalperson-rMlxZR9MS24,
	davidel-AhlLAIvw+VEjIGhXcJzhZg,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, luto-kltTT9wpgjJwATOyAt5JVQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds, Alexander Viro
In-Reply-To: <20150305091517.GA25158-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>


On 03/05/2015 04:15 AM, Ingo Molnar wrote:
> * Jason Baron <jbaron-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org> wrote:
>
>> 2) We are using the wakeup in this case to 'assign' work more 
>> permanently to the thread. That is, in the case of a listen socket 
>> we then add the connected socket to the woken up threads local set 
>> of epoll events. So the load persists past the wake up. And in this 
>> case, doing the round robin wakeups, simply allows us to access more 
>> cpu bandwidth. (I'm also looking into potentially using cpu affinity 
>> to do the wakeups as well as you suggested.)
> So this is the part that I still don't understand.
>
> What difference does LIFO versus FIFO wakeups make to CPU utilization: 
> a thread waiting for work is idle, no matter whether it ran most 
> recently or least recently.
>
> Once an idle worker thread is woken it will compute its own work, for 
> whatever time it needs to, and won't be bothered by epoll again until 
> it finished its work and starts waiting again.
>
> So regardless the wakeup order it's the same principal bandwidth 
> utilization, modulo caching artifacts [*] and modulo scheduling 
> artifacts [**]:

So just adding the wakeup source as 'exclusive', I think would
give much of the desired behavior as you point out. In the first
patch posting I separated 'exclusive' from 'rotate' (where rotate
depended on exclusive), since the idle threads will tend to get
assigned the new work vs. the busy threads as you point out
and the workload naturally spreads out (modulo the artifacts
you mentioned).

However, I added the 'rotate' b/c I'm assigning work via the
wakeup that persists past the wakeup point. So without the rotate
I might end up assigning a lot of work to always say the first
thread if its always idle. And then I might get a large burst of
work queued to it at some later point. The rotate is intended
to address this case.

To use some pseudo-code in hopes of clarifying things, each
thread is roughly doing:

epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd...);

while(1) {

    epoll_wait(epfd...);
    fd = accept(listen_fd...);
    epoll_ctl(epfd, EPOLL_CTL_ADD, fd...);
    ...do any additional desired fd processing...
}

So since the work persists past the wakeup point (after
the 'fd' has been assigned to the epfd set of the local
thread), I am trying to balance out future load.

This is an issue that current userspace has to address in
various ways. In our case, we periodically remove all
epfds from the listen socket, and then re-add in a
different order periodically. Another alternative that was
suggested by Eric was to have a dedicated thread(s), to
do the assignment. So these approaches can work to an
extent, but they seem at least to me to complicate
userspace somewhat. And at least in our case, its not
providing as good balancing as this approach.

So I am trying to use epoll in a special way to do work
assignment. I think the model is different from the
standard waker/wakee model. So to that end, in this
v3 version, I've attempted to isolate all the changes to
be contained within epoll to reflect that fact.

Thanks,

-Jason


>
> [*]  Caching artifacts: in that sense Andrew's point stands: given 
>      multiple equivalent choices it's more beneficial to pick a thread 
>      that was most recently used (and is thus most cache-hot - i.e. 
>      the current wakeup behavior), versus a thread that was least 
>      recently used (and is thus the most cache-cold - i.e. the 
>      round-robin wakeup you introduce).
>
> [**] The hack patch I posted in my previous reply.
>
> Thanks,
>
> 	Ingo
>

^ permalink raw reply

* Re: [PATCH v1 3/6] eeprom: Add bindings for simple eeprom framework
From: Srinivas Kandagatla @ 2015-03-05 22:34 UTC (permalink / raw)
  To: Rob Herring
  Cc: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	Maxime Ripard, Rob Herring, Pawel Moll, Kumar Gala,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Stephen Boyd,
	Andrew Lunn, Arnd Bergmann, Mark Brown, Greg Kroah-Hartman
In-Reply-To: <CAL_Jsq+wrZYn82otDfLsTVpgmhLFWtzRHQ4v+qn-Ks--ZpXR8Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

>> +
>> +For example:
>> +
>> +       /* Provider */
>> +       qfprom: qfprom@00700000 {
>> +               compatible      = "qcom,qfprom";
>> +               reg             = <0x00700000 0x1000>;
>> +               ...
>> +
>> +               /* Data cells */
>> +               tsens_calibration: calib@404 {
>> +                       reg = <0x404 0x10>;
>> +               };
>> +
>> +               serial_number: sn {
>> +                       reg = <0x104 0x4>, <0x204 0x4>, <0x30c 0x4>;
>> +
>> +               };
>> +               ...
>> +       };
>> +
>> += Data consumers =
>> +Are drivers which consume eeprom data cells.
>
> s/drivers/device nodes/
>
Thats true, "device nodes" makes sense.
>> +
>> +Required properties:
>> +
>> +eeproms: List of phandle and data cell the device might be interested in.
>> +
>> +Optional properties:
>> +
>> +eeprom-names: List of data cell name strings sorted in the same order
>> +             as the resets property. Consumers drivers will use
>
> resets?
Opps..
I remember fixing this, I will take care of it in next version.
>
>> +             eeprom-names to differentiate between multiple cells,
>> +             and hence being able to know what these cells are for.
>
> Is this still needed? The sub-node name defines the name. Or you can
> use reg-names with-in the sub-node.
Yes, eeprom-names is needed in the consumer nodes, where there are 
multiple eeproms cells, its easy to lookup by name rather than 
index,which depends on the order of the entries.

reg-names inside the "data cells" is ok, but I can't think of its use 
immediately. May be useful for debug?

--srini
 >
>
> Rob
>
>> +
>> +For example:
>> +
>> +       tsens {
>> +               ...
>> +               eeproms = <&tsens_calibration>;
>> +               eeprom-names = "calibration";
>> +       };
>> --
>> 1.9.1
>>
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] capabilities: Ambient capability set V2
From: Andy Lutomirski @ 2015-03-05 23:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jarkko Sakkinen, Andrew Morton, LSM List, Andrew G. Morgan,
	Michael Kerrisk, Mimi Zohar, linux-kernel@vger.kernel.org,
	Austin S Hemmelgarn, Aaron Jones, Serge Hallyn, Serge E. Hallyn,
	Markku Savela, Linux API, Jonathan Corbet
In-Reply-To: <alpine.DEB.2.11.1503051235300.32373@gentwo.org>

On Mar 5, 2015 10:41 AM, "Christoph Lameter" <cl@linux.com> wrote:
>
> On Thu, 5 Mar 2015, Serge E. Hallyn wrote:
>
> > > >
> > > > So I'd say drop this change ^
> > >
> > > Then the ambient caps get ignored for a executables that have capabilities
> > > seton the file?
> >
> > Yes.  Those are assumed to already know what they're doing.
>
> Do they? What if there is a LD_PRELOAD library that redirects socket calls
> and that needs raw device access (there are actually a number of software
> packages like that to reduce the latency of network I/O. See for example
> Solarflare's software products and the current rsocket libary in OFED.
> There are cap issues if the rsocket library should be made useful for
> Ethernet instead of infiniband).
>
> > Why?  Do you foresee cases where a file that has fP set needs capabilities
> > that aren't in its fP?
>
> Yes due to the library issues.

You can't LD_PRELOAD and fP together.  And I'm still unconvinced that
ambient caps can ever be safe in conjunction with fP.  I'll grill you
next week on what you're trying to do that makes you want this :)

--Andy

>
> > It seems more likely that they'll risk misbehaving due to an unexpected set
> > of caps.
>
> The userspace driver code in the library wont work since it does not have
> the caps to access the raw device registers.

^ permalink raw reply

* Re: [PATCH 07/14] clockevent: Add STM32 Timer driver
From: Linus Walleij @ 2015-03-06  8:57 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: Jonathan Corbet, Rob Herring, Pawel Moll, Mark Rutland,
	Ian Campbell, Kumar Gala, Philipp Zabel, Russell King,
	Daniel Lezcano, Thomas Gleixner, Greg Kroah-Hartman, Jiri Slaby,
	Arnd Bergmann, Andrew Morton, David S. Miller,
	Mauro Carvalho Chehab, Joe Perches, Antti Palosaari, Tejun Heo,
	Will Deacon, Nikolay Borisov, Rusty Russell, Kees Cook, Michal
In-Reply-To: <1423763164-5606-8-git-send-email-mcoquelin.stm32@gmail.com>

On Thu, Feb 12, 2015 at 6:45 PM, Maxime Coquelin
<mcoquelin.stm32@gmail.com> wrote:

> STM32 MCUs feature 16 and 32 bits general purpose timers with prescalers.
> The drivers detects whether the time is 16 or 32 bits, and applies a
> 1024 prescaler value if it is 16 bits.
>
> Signed-off-by: Maxime Coquelin <mcoquelin.stm32@gmail.com>

This is a nice driver.
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>

Yours,
Linus Walleij

^ permalink raw reply

* Re: [PATCH 14/14] MAINTAINERS: Add entry for STM32 MCUs
From: Linus Walleij @ 2015-03-06  9:03 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: Jonathan Corbet, Rob Herring, Pawel Moll, Mark Rutland,
	Ian Campbell, Kumar Gala, Philipp Zabel, Russell King,
	Daniel Lezcano, Thomas Gleixner, Greg Kroah-Hartman, Jiri Slaby,
	Arnd Bergmann, Andrew Morton, David S. Miller,
	Mauro Carvalho Chehab, Joe Perches, Antti Palosaari, Tejun Heo,
	Will Deacon, Nikolay Borisov, Rusty Russell, Kees Cook, Michal
In-Reply-To: <1423763164-5606-15-git-send-email-mcoquelin.stm32-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On Thu, Feb 12, 2015 at 6:46 PM, Maxime Coquelin
<mcoquelin.stm32-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> Add a MAINTAINER entry covering all STM32 machine and drivers files.
>
> Signed-off-by: Maxime Coquelin <mcoquelin.stm32-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

(...)
> +F:     drivers/clocksource/arm_system_timer.c

Is that all? And that is not even a STM32 specific driver.

Yours,
Linus Walleij

^ permalink raw reply

* Re: [PATCH v2 10/18] dt-bindings: Document the STM32 pin controller
From: Linus Walleij @ 2015-03-06  9:12 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: Uwe Kleine-König, Andreas Färber, Geert Uytterhoeven,
	Rob Herring, Philipp Zabel, Jonathan Corbet, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala, Russell King,
	Daniel Lezcano, Thomas Gleixner, Greg Kroah-Hartman, Jiri Slaby,
	Arnd Bergmann, Andrew Morton, David S. Miller,
	Mauro Carvalho Chehab, Joe Perches, Antti Palosaari, Tejun Heo
In-Reply-To: <1424455277-29983-11-git-send-email-mcoquelin.stm32-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On Fri, Feb 20, 2015 at 7:01 PM, Maxime Coquelin
<mcoquelin.stm32-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> This adds documentation of device tree bindings for the
> STM32 pin controller.
>
> Signed-off-by: Maxime Coquelin <mcoquelin.stm32-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
(...)
> +Pin controller node:
> +Required properies:
> +- compatible   : "st,stm32-pinctrl"
> +- #address-cells: The value of this property must be 1
> +- #size-cells  : The value of this property must be 1
> +- ranges       : defines mapping between pin controller node (parent) to
> +  gpio-bank node (children).
> +
> +GPIO controller/bank node:
> +Required properties:
> +- gpio-controller : Indicates this device is a GPIO controller
> +- #gpio-cells    : Should be two.
> +                       The first cell is the pin number
> +                       The second one is the polarity:
> +                               - 0 for active high
> +                               - 1 for active low
> +- reg            : The gpio address range, relative to the pinctrl range
> +- st,bank-name   : Should be a name string for this bank as specified in
> +  the datasheet

OK...

(...)
> +               ...
> +               pin-functions nodes follow...
> +       };
> +
> +Contents of function subnode node:
> +----------------------------------
> +
> +Required properties for pin configuration node:
> +- st,pins      : Child node with list of pins with configuration.

Don't use vendor prefix. Use the new standard notation, just "pins".
See Documentation/devicetree/bindings/pinctrl/pinctrl-bindings.txt

> +Below is the format of how each pin conf should look like.
> +
> +<bank offset altmode pull type speed>
> +
> +Every PIO is represented with 4 to 6 parameters.
> +Each parameter is explained as below.
> +
> +- bank   : Should be bank phandle to which this PIO belongs.
> +- offset  : Offset in the PIO bank.

No. Don't hardcode register offsets into the DTS files. This is
something the driver should know from the compatible string or
instance number, not by explicit reference.

> +- altmode : Should be mode or alternate function number associated this pin, as
> +described in the datasheet (IN, OUT, ALT0...ALT15, ANALOG)
> +- pull   : Should be either NO_PULL, PULL_UP or PULL_DOWN
> +- type   : Should be either PUSH_PULL or OPEN_DRAIN.
> +           Setting it is not needed for IN and ANALOG modes, or alternate
> +           functions acting as inputs.
> +- speed          : Value taken from the datasheet, depending on the function
> +(LOW_SPEED, MEDIUM_SPEED, FAST_SPEED, HIGH_SPEED)
> +           Setting it is not needed for IN and ANALOG modes, or alternate
> +           functions acting as inputs.

NACK. Make your Kconfig select GENERIC_PINCONF and use the
already existing generic pin configuration specifiers from the standard
bindings. Like bias-pull-up; etc.

This kind of custom pin config is a thing of the past.

> +usart1 {
> +       pinctrl_usart1: usart1-0 {
> +               st,pins {
> +                       tx = <&gpioa 9 ALT7 NO_PULL PUSH_PULL LOW_SPEED>;
> +                       rx = <&gpioa 10 ALT7 NO_PULL PUSH_PULL LOW_SPEED>;
> +               };
> +       };
> +};

You might have to use separate parameters for muxing and config
in your description.

Yours,
Linus Walleij

^ permalink raw reply

* Re: [PATCH v2 11/18] pinctrl: Add pinctrl driver for STM32 MCUs
From: Linus Walleij @ 2015-03-06  9:24 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: Uwe Kleine-König, Andreas Färber, Geert Uytterhoeven,
	Rob Herring, Philipp Zabel, Jonathan Corbet, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala, Russell King,
	Daniel Lezcano, Thomas Gleixner, Greg Kroah-Hartman, Jiri Slaby,
	Arnd Bergmann, Andrew Morton, David S. Miller,
	Mauro Carvalho Chehab, Joe Perches, Antti Palosaari, Tejun Heo
In-Reply-To: <1424455277-29983-12-git-send-email-mcoquelin.stm32@gmail.com>

On Fri, Feb 20, 2015 at 7:01 PM, Maxime Coquelin
<mcoquelin.stm32@gmail.com> wrote:

> This driver adds pinctrl and GPIO support to STMicrolectronic's
> STM32 family of MCUs.
>
> Pin muxing and GPIO handling have been tested on STM32F429
> based Discovery board.
>
> Signed-off-by: Maxime Coquelin <mcoquelin.stm32@gmail.com>

(...)
> +config PINCTRL_STM32
> +       bool "STMicroelectronics STM32 pinctrl driver"
> +       depends on OF
> +       depends on ARCH_STM32 || COMPILE_TEST
> +       select PINMUX
> +       select PINCONF
> +       select GPIOLIB_IRQCHIP
> +       help
> +         This selects the device tree based generic pinctrl driver for STM32.

Good start! Especially that you use GPIOLIB_IRQCHIP.

But this (as discussed earlier) should select GENERIC_PINCONF

Stopping review here so you can reengineer it a bit using GENERIC_PINCONF
for next submission.

Also think about pinmux in single registers, whether you want to do this
with a single value for a register or using strings to identify groups
and functions.

Yours,
Linus Walleij

^ permalink raw reply

* Re: [PATCH v2 10/18] dt-bindings: Document the STM32 pin controller
From: Linus Walleij @ 2015-03-06  9:35 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: Uwe Kleine-König, Andreas Färber, Geert Uytterhoeven,
	Rob Herring, Philipp Zabel, Jonathan Corbet, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala, Russell King,
	Daniel Lezcano, Thomas Gleixner, Greg Kroah-Hartman, Jiri Slaby,
	Arnd Bergmann, Andrew Morton, David S. Miller,
	Mauro Carvalho Chehab, Joe Perches, Antti Palosaari, Tejun Heo
In-Reply-To: <1424455277-29983-11-git-send-email-mcoquelin.stm32@gmail.com>

I saw this other thing:

On Fri, Feb 20, 2015 at 7:01 PM, Maxime Coquelin
<mcoquelin.stm32@gmail.com> wrote:

> This adds documentation of device tree bindings for the
> STM32 pin controller.
>
> Signed-off-by: Maxime Coquelin <mcoquelin.stm32@gmail.com>
(...)
> +- altmode : Should be mode or alternate function number associated this pin, as
> +described in the datasheet (IN, OUT, ALT0...ALT15, ANALOG)

We can now describe muxing (altmodes etc) in two ways as described
in the generic bindings in
Documentation/devicetree/bindings/pinctrl/pinctrl-bindings.txt

This is done by strings combining a function with N groups.

We are also discussing having a single config number setting up
all and keeping down the size of the DTB (which
is close to what you're doing here). Please take part in that
discussion to standardize such bindings. Sascha Hauer and
others are involved, don't know the exact topic right now but
it involved using a single "pinmux" parameter in the device treel.

All agree on using the standardized pin config bindings henceforth
so start by migrating to these.

Yours,
Linus Walleij

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox