All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Dan Williams <dan.j.williams@intel.com>,
	Christoph Hellwig <hch@lst.de>,
	Doug Ledford <dledford@redhat.com>,
	Hal Rosenstock <hal.rosenstock@gmail.com>,
	Inki Dae <inki.dae@samsung.com>, Jan Kara <jack@suse.cz>,
	Jason Gunthorpe <jgg@mellanox.com>,
	Jeff Moyer <jmoyer@redhat.com>,
	Joonyoung Shim <jy0922.shim@samsung.com>,
	Kyungmin Park <kyungmin.park@samsung.com>,
	Mauro Carvalho Chehab <mchehab@kernel.org>,
	Mel Gorman <mgorman@suse.de>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	Sean Hefty <sean.hefty@intel.com>,
	Seung-Woo Kim <sw0312.kim@samsung.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 4.9 33/39] mm: introduce get_user_pages_longterm
Date: Mon, 26 Feb 2018 21:20:54 +0100	[thread overview]
Message-ID: <20180226201645.125678735@linuxfoundation.org> (raw)
In-Reply-To: <20180226201643.660109883@linuxfoundation.org>

4.9-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Dan Williams <dan.j.williams@intel.com>

commit 2bb6d2837083de722bfdc369cb0d76ce188dd9b4 upstream.

Patch series "introduce get_user_pages_longterm()", v2.

Here is a new get_user_pages api for cases where a driver intends to
keep an elevated page count indefinitely.  This is distinct from usages
like iov_iter_get_pages where the elevated page counts are transient.
The iov_iter_get_pages cases immediately turn around and submit the
pages to a device driver which will put_page when the i/o operation
completes (under kernel control).

In the longterm case userspace is responsible for dropping the page
reference at some undefined point in the future.  This is untenable for
filesystem-dax case where the filesystem is in control of the lifetime
of the block / page and needs reasonable limits on how long it can wait
for pages in a mapping to become idle.

Fixing filesystems to actually wait for dax pages to be idle before
blocks from a truncate/hole-punch operation are repurposed is saved for
a later patch series.

Also, allowing longterm registration of dax mappings is a future patch
series that introduces a "map with lease" semantic where the kernel can
revoke a lease and force userspace to drop its page references.

I have also tagged these for -stable to purposely break cases that might
assume that longterm memory registrations for filesystem-dax mappings
were supported by the kernel.  The behavior regression this policy
change implies is one of the reasons we maintain the "dax enabled.
Warning: EXPERIMENTAL, use at your own risk" notification when mounting
a filesystem in dax mode.

It is worth noting the device-dax interface does not suffer the same
constraints since it does not support file space management operations
like hole-punch.

This patch (of 4):

Until there is a solution to the dma-to-dax vs truncate problem it is
not safe to allow long standing memory registrations against
filesytem-dax vmas.  Device-dax vmas do not have this problem and are
explicitly allowed.

This is temporary until a "memory registration with layout-lease"
mechanism can be implemented for the affected sub-systems (RDMA and
V4L2).

[akpm@linux-foundation.org: use kcalloc()]
Link: http://lkml.kernel.org/r/151068939435.7446.13560129395419350737.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Inki Dae <inki.dae@samsung.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Joonyoung Shim <jy0922.shim@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 include/linux/dax.h |    5 ----
 include/linux/fs.h  |   20 ++++++++++++++++
 include/linux/mm.h  |   13 ++++++++++
 mm/gup.c            |   64 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 97 insertions(+), 5 deletions(-)

--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -61,11 +61,6 @@ static inline int dax_pmd_fault(struct v
 int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
 
-static inline bool vma_is_dax(struct vm_area_struct *vma)
-{
-	return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
-}
-
 static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -18,6 +18,7 @@
 #include <linux/bug.h>
 #include <linux/mutex.h>
 #include <linux/rwsem.h>
+#include <linux/mm_types.h>
 #include <linux/capability.h>
 #include <linux/semaphore.h>
 #include <linux/fiemap.h>
@@ -3033,6 +3034,25 @@ static inline bool io_is_direct(struct f
 	return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
 }
 
+static inline bool vma_is_dax(struct vm_area_struct *vma)
+{
+	return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
+}
+
+static inline bool vma_is_fsdax(struct vm_area_struct *vma)
+{
+	struct inode *inode;
+
+	if (!vma->vm_file)
+		return false;
+	if (!vma_is_dax(vma))
+		return false;
+	inode = file_inode(vma->vm_file);
+	if (inode->i_mode == S_IFCHR)
+		return false; /* device-dax */
+	return true;
+}
+
 static inline int iocb_flags(struct file *file)
 {
 	int res = 0;
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1288,6 +1288,19 @@ long __get_user_pages_unlocked(struct ta
 			       struct page **pages, unsigned int gup_flags);
 long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 		    struct page **pages, unsigned int gup_flags);
+#ifdef CONFIG_FS_DAX
+long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
+			    unsigned int gup_flags, struct page **pages,
+			    struct vm_area_struct **vmas);
+#else
+static inline long get_user_pages_longterm(unsigned long start,
+		unsigned long nr_pages, unsigned int gup_flags,
+		struct page **pages, struct vm_area_struct **vmas)
+{
+	return get_user_pages(start, nr_pages, gup_flags, pages, vmas);
+}
+#endif /* CONFIG_FS_DAX */
+
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
 
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -982,6 +982,70 @@ long get_user_pages(unsigned long start,
 }
 EXPORT_SYMBOL(get_user_pages);
 
+#ifdef CONFIG_FS_DAX
+/*
+ * This is the same as get_user_pages() in that it assumes we are
+ * operating on the current task's mm, but it goes further to validate
+ * that the vmas associated with the address range are suitable for
+ * longterm elevated page reference counts. For example, filesystem-dax
+ * mappings are subject to the lifetime enforced by the filesystem and
+ * we need guarantees that longterm users like RDMA and V4L2 only
+ * establish mappings that have a kernel enforced revocation mechanism.
+ *
+ * "longterm" == userspace controlled elevated page count lifetime.
+ * Contrast this to iov_iter_get_pages() usages which are transient.
+ */
+long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
+		unsigned int gup_flags, struct page **pages,
+		struct vm_area_struct **vmas_arg)
+{
+	struct vm_area_struct **vmas = vmas_arg;
+	struct vm_area_struct *vma_prev = NULL;
+	long rc, i;
+
+	if (!pages)
+		return -EINVAL;
+
+	if (!vmas) {
+		vmas = kcalloc(nr_pages, sizeof(struct vm_area_struct *),
+			       GFP_KERNEL);
+		if (!vmas)
+			return -ENOMEM;
+	}
+
+	rc = get_user_pages(start, nr_pages, gup_flags, pages, vmas);
+
+	for (i = 0; i < rc; i++) {
+		struct vm_area_struct *vma = vmas[i];
+
+		if (vma == vma_prev)
+			continue;
+
+		vma_prev = vma;
+
+		if (vma_is_fsdax(vma))
+			break;
+	}
+
+	/*
+	 * Either get_user_pages() failed, or the vma validation
+	 * succeeded, in either case we don't need to put_page() before
+	 * returning.
+	 */
+	if (i >= rc)
+		goto out;
+
+	for (i = 0; i < rc; i++)
+		put_page(pages[i]);
+	rc = -EOPNOTSUPP;
+out:
+	if (vmas != vmas_arg)
+		kfree(vmas);
+	return rc;
+}
+EXPORT_SYMBOL(get_user_pages_longterm);
+#endif /* CONFIG_FS_DAX */
+
 /**
  * populate_vma_page_range() -  populate a range of pages in the vma.
  * @vma:   target vma

  parent reply	other threads:[~2018-02-26 20:20 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-26 20:20 [PATCH 4.9 00/39] 4.9.85-stable review Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 01/39] netfilter: drop outermost socket lock in getsockopt() Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 02/39] xtensa: fix high memory/reserved memory collision Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 03/39] scsi: ibmvfc: fix misdefined reserved field in ibmvfc_fcp_rsp_info Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 04/39] cfg80211: fix cfg80211_beacon_dup Greg Kroah-Hartman
2018-02-26 20:20   ` Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 05/39] X.509: fix BUG_ON() when hash algorithm is unsupported Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 06/39] PKCS#7: fix certificate chain verification Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 07/39] RDMA/uverbs: Protect from command mask overflow Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 08/39] iio: buffer: check if a buffer has been set up when poll is called Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 09/39] iio: adis_lib: Initialize trigger before requesting interrupt Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 10/39] x86/oprofile: Fix bogus GCC-8 warning in nmi_setup() Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 11/39] irqchip/gic-v3: Use wmb() instead of smb_wmb() in gic_raise_softirq() Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 12/39] PCI/cxgb4: Extend T3 PCI quirk to T4+ devices Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 13/39] ohci-hcd: Fix race condition caused by ohci_urb_enqueue() and io_watchdog_func() Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 14/39] usb: ohci: Proper handling of ed_rm_list to handle race condition between usb_kill_urb() and finish_unlinks() Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 15/39] arm64: Disable unhandled signal log messages by default Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 16/39] Add delay-init quirk for Corsair K70 RGB keyboards Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 17/39] drm/edid: Add 6 bpc quirk for CPT panel in Asus UX303LA Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 18/39] usb: dwc3: gadget: Set maxpacket size for ep0 IN Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 19/39] usb: ldusb: add PIDs for new CASSY devices supported by this driver Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 20/39] Revert "usb: musb: host: dont start next rx urb if current one failed" Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 21/39] usb: gadget: f_fs: Process all descriptors during bind Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 22/39] usb: renesas_usbhs: missed the "running" flag in usb_dmac with rx path Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 23/39] drm/amdgpu: Add dpm quirk for Jet PRO (v2) Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 24/39] drm/amdgpu: add atpx quirk handling (v2) Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 25/39] drm/amdgpu: Avoid leaking PM domain on driver unbind (v2) Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 26/39] drm/amdgpu: add new device to use atpx quirk Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 27/39] binder: add missing binder_unlock() Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 28/39] X.509: fix NULL dereference when restricting key with unsupported_sig Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 29/39] mm: avoid spurious bad pmd warning messages Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 30/39] fs/dax.c: fix inefficiency in dax_writeback_mapping_range() Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 31/39] libnvdimm: fix integer overflow static analysis warning Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 32/39] device-dax: implement ->split() to catch invalid munmap attempts Greg Kroah-Hartman
2018-02-26 20:20 ` Greg Kroah-Hartman [this message]
2018-02-26 20:20 ` [PATCH 4.9 34/39] v4l2: disable filesystem-dax mapping support Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 35/39] IB/core: disable memory registration of filesystem-dax vmas Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 36/39] libnvdimm, dax: fix 1GB-aligned namespaces vs physical misalignment Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 37/39] mm: Fix devm_memremap_pages() collision handling Greg Kroah-Hartman
2018-02-26 20:20 ` [PATCH 4.9 38/39] mm: fail get_vaddr_frames() for filesystem-dax mappings Greg Kroah-Hartman
2018-02-26 20:21 ` [PATCH 4.9 39/39] x86/entry/64: Clear extra registers beyond syscall arguments, to reduce speculation attack surface Greg Kroah-Hartman
2018-02-27  0:57 ` [PATCH 4.9 00/39] 4.9.85-stable review Shuah Khan
2018-02-27  7:12 ` Naresh Kamboju
2018-02-27 14:56 ` Guenter Roeck
2018-02-27 18:34   ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180226201645.125678735@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=dledford@redhat.com \
    --cc=hal.rosenstock@gmail.com \
    --cc=hch@lst.de \
    --cc=inki.dae@samsung.com \
    --cc=jack@suse.cz \
    --cc=jgg@mellanox.com \
    --cc=jmoyer@redhat.com \
    --cc=jy0922.shim@samsung.com \
    --cc=kyungmin.park@samsung.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=mgorman@suse.de \
    --cc=ross.zwisler@linux.intel.com \
    --cc=sean.hefty@intel.com \
    --cc=stable@vger.kernel.org \
    --cc=sw0312.kim@samsung.com \
    --cc=torvalds@linux-foundation.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.