[RFC PATCH] x86, uaccess, pmem: introduce copy_from_iter

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] x86, uaccess, pmem: introduce copy_from_iter_writethru for dax + pmem
       [not found] <20170425012230.GX29622@ZenIV.linux.org.uk>
@ 2017-04-26 21:56 ` Dan Williams
  2017-04-27  6:30   ` Ingo Molnar
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2017-04-26 21:56 UTC (permalink / raw)
  To: viro
  Cc: Jan Kara, Matthew Wilcox, x86, linux-kernel, hch, linux-block,
	linux-nvdimm, Jeff Moyer, Ingo Molnar, H. Peter Anvin,
	linux-fsdevel, Thomas Gleixner, Ross Zwisler

The pmem driver has a need to transfer data with a persistent memory
destination and be able to rely on the fact that the destination writes
are not cached. It is sufficient for the writes to be flushed to a
cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
userspace to call fsync() to ensure data-writes have reached a
power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
REQ_FLUSH to the pmem driver which will turn around and fence previous
writes with an "sfence".

Implement a __copy_from_user_inatomic_writethru, memcpy_page_writethru,
and memcpy_writethru, that guarantee that the destination buffer is not
dirty in the cpu cache on completion. The new copy_from_iter_writethru
and sub-routines will be used to replace the "pmem api"
(include/linux/pmem.h + arch/x86/include/asm/pmem.h). The availability
of copy_from_iter_writethru() and memcpy_writethru() are gated by the
CONFIG_ARCH_HAS_UACCESS_WRITETHRU config symbol, and fallback to
copy_from_iter_nocache() and plain memcpy() otherwise.

This is meant to satisfy the concern from Linus that if a driver wants
to do something beyond the normal nocache semantics it should be
something private to that driver [1], and Al's concern that anything
uaccess related belongs with the rest of the uaccess code [2].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html

Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

This patch is based on a merge of vfs.git/for-next and
nvdimm.git/libnvdimm-for-next.

 arch/x86/Kconfig                  |    1 
 arch/x86/include/asm/string_64.h  |    5 +
 arch/x86/include/asm/uaccess_64.h |   13 ++++
 arch/x86/lib/usercopy_64.c        |  128 +++++++++++++++++++++++++++++++++++++
 drivers/acpi/nfit/core.c          |    2 -
 drivers/nvdimm/claim.c            |    2 -
 drivers/nvdimm/pmem.c             |   13 +++-
 drivers/nvdimm/region_devs.c      |    2 -
 include/linux/dax.h               |    3 +
 include/linux/string.h            |    6 ++
 include/linux/uio.h               |   15 ++++
 lib/Kconfig                       |    3 +
 lib/iov_iter.c                    |   22 ++++++
 13 files changed, 210 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1d50fdff77ee..bd3ff407d707 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,6 +54,7 @@ config X86
 	select ARCH_HAS_KCOV			if X86_64
 	select ARCH_HAS_MMIO_FLUSH
 	select ARCH_HAS_PMEM_API		if X86_64
+	select ARCH_HAS_UACCESS_WRITETHRU	if X86_64
 	select ARCH_HAS_SET_MEMORY
 	select ARCH_HAS_SG_CHAIN
 	select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 733bae07fb29..60173bc51603 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -109,6 +109,11 @@ memcpy_mcsafe(void *dst, const void *src, size_t cnt)
 	return 0;
 }
 
+#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
+#define __HAVE_ARCH_MEMCPY_WRITETHRU 1
+void memcpy_writethru(void *dst, const void *src, size_t cnt);
+#endif
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index c5504b9a472e..748e8a50e4b3 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -171,6 +171,11 @@ unsigned long raw_copy_in_user(void __user *dst, const void __user *src, unsigne
 extern long __copy_user_nocache(void *dst, const void __user *src,
 				unsigned size, int zerorest);
 
+extern long __copy_user_writethru(void *dst, const void __user *src,
+				  unsigned size);
+extern void memcpy_page_writethru(char *to, struct page *page, size_t offset,
+				  size_t len);
+
 static inline int
 __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
 				  unsigned size)
@@ -179,6 +184,14 @@ __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
 	return __copy_user_nocache(dst, src, size, 0);
 }
 
+static inline int
+__copy_from_user_inatomic_writethru(void *dst, const void __user *src,
+				  unsigned size)
+{
+	kasan_check_write(dst, size);
+	return __copy_user_writethru(dst, src, size);
+}
+
 unsigned long
 copy_user_handle_tail(char *to, char *from, unsigned len);
 
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 3b7c40a2e3e1..144cb5e59193 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -7,6 +7,7 @@
  */
 #include <linux/export.h>
 #include <linux/uaccess.h>
+#include <linux/highmem.h>
 
 /*
  * Zero Userspace
@@ -73,3 +74,130 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
 	clac();
 	return len;
 }
+
+#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
+/**
+ * clean_cache_range - write back a cache range with CLWB
+ * @vaddr:	virtual start address
+ * @size:	number of bytes to write back
+ *
+ * Write back a cache range using the CLWB (cache line write back)
+ * instruction. Note that @size is internally rounded up to be cache
+ * line size aligned.
+ */
+static void clean_cache_range(void *addr, size_t size)
+{
+	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
+	unsigned long clflush_mask = x86_clflush_size - 1;
+	void *vend = addr + size;
+	void *p;
+
+	for (p = (void *)((unsigned long)addr & ~clflush_mask);
+	     p < vend; p += x86_clflush_size)
+		clwb(p);
+}
+
+long __copy_user_writethru(void *dst, const void __user *src, unsigned size)
+{
+	unsigned long flushed, dest = (unsigned long) dst;
+	long rc = __copy_user_nocache(dst, src, size, 0);
+
+	/*
+	 * __copy_user_nocache() uses non-temporal stores for the bulk
+	 * of the transfer, but we need to manually flush if the
+	 * transfer is unaligned. A cached memory copy is used when
+	 * destination or size is not naturally aligned. That is:
+	 *   - Require 8-byte alignment when size is 8 bytes or larger.
+	 *   - Require 4-byte alignment when size is 4 bytes.
+	 */
+	if (size < 8) {
+		if (!IS_ALIGNED(dest, 4) || size != 4)
+			clean_cache_range(dst, 1);
+	} else {
+		if (!IS_ALIGNED(dest, 8)) {
+			dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
+			clean_cache_range(dst, 1);
+		}
+
+		flushed = dest - (unsigned long) dst;
+		if (size > flushed && !IS_ALIGNED(size - flushed, 8))
+			clean_cache_range(dst + size - 1, 1);
+	}
+
+	return rc;
+}
+
+void memcpy_writethru(void *_dst, const void *_src, size_t size)
+{
+	unsigned long dest = (unsigned long) _dst;
+	unsigned long source = (unsigned long) _src;
+
+	/* cache copy and flush to align dest */
+	if (!IS_ALIGNED(dest, 8)) {
+		unsigned len = min_t(unsigned, size, ALIGN(dest, 8) - dest);
+
+		memcpy((void *) dest, (void *) source, len);
+		clean_cache_range((void *) dest, len);
+		dest += len;
+		source += len;
+		size -= len;
+		if (!size)
+			return;
+	}
+
+	/* 4x8 movnti loop */
+	while (size >= 32) {
+		asm("movq    (%0), %%r8\n"
+		    "movq   8(%0), %%r9\n"
+		    "movq  16(%0), %%r10\n"
+		    "movq  24(%0), %%r11\n"
+		    "movnti  %%r8,   (%1)\n"
+		    "movnti  %%r9,  8(%1)\n"
+		    "movnti %%r10, 16(%1)\n"
+		    "movnti %%r11, 24(%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8", "r9", "r10", "r11");
+		dest += 32;
+		source += 32;
+		size -= 32;
+	}
+
+	/* 1x8 movnti loop */
+	while (size >= 8) {
+		asm("movq    (%0), %%r8\n"
+		    "movnti  %%r8,   (%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8");
+		dest += 8;
+		source += 8;
+		size -= 8;
+	}
+
+	/* 1x4 movnti loop */
+	while (size >= 4) {
+		asm("movl    (%0), %%r8d\n"
+		    "movnti  %%r8d,   (%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8");
+		dest += 4;
+		source += 4;
+		size -= 4;
+	}
+
+	/* cache copy for remaining bytes */
+	if (size) {
+		memcpy((void *) dest, (void *) source, size);
+		clean_cache_range((void *) dest, size);
+	}
+}
+EXPORT_SYMBOL_GPL(memcpy_writethru);
+
+void memcpy_page_writethru(char *to, struct page *page, size_t offset,
+		size_t len)
+{
+	char *from = kmap_atomic(page);
+
+	memcpy_writethru(to, from + offset, len);
+	kunmap_atomic(from);
+}
+#endif
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index d0c07b2344e4..c84e242f91ed 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -1776,7 +1776,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
 		}
 
 		if (rw)
-			memcpy_to_pmem(mmio->addr.aperture + offset,
+			memcpy_writethru(mmio->addr.aperture + offset,
 					iobuf + copied, c);
 		else {
 			if (nfit_blk->dimm_flags & NFIT_BLK_READ_FLUSH)
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index 3a35e8028b9c..38822f6fa49f 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -266,7 +266,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
 			rc = -EIO;
 	}
 
-	memcpy_to_pmem(nsio->addr + offset, buf, size);
+	memcpy_writethru(nsio->addr + offset, buf, size);
 	nvdimm_flush(to_nd_region(ndns->dev.parent));
 
 	return rc;
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3b3dab73d741..28dc82a595a5 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -28,6 +28,7 @@
 #include <linux/pfn_t.h>
 #include <linux/slab.h>
 #include <linux/pmem.h>
+#include <linux/uio.h>
 #include <linux/dax.h>
 #include <linux/nd.h>
 #include "pmem.h"
@@ -79,7 +80,7 @@ static void write_pmem(void *pmem_addr, struct page *page,
 {
 	void *mem = kmap_atomic(page);
 
-	memcpy_to_pmem(pmem_addr, mem + off, len);
+	memcpy_writethru(pmem_addr, mem + off, len);
 	kunmap_atomic(mem);
 }
 
@@ -234,8 +235,15 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
 	return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
 }
 
+static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i)
+{
+	return copy_from_iter_writethru(addr, bytes, i);
+}
+
 static const struct dax_operations pmem_dax_ops = {
 	.direct_access = pmem_dax_direct_access,
+	.copy_from_iter = pmem_copy_from_iter,
 };
 
 static void pmem_release_queue(void *q)
@@ -288,7 +296,8 @@ static int pmem_attach_disk(struct device *dev,
 	dev_set_drvdata(dev, pmem);
 	pmem->phys_addr = res->start;
 	pmem->size = resource_size(res);
-	if (nvdimm_has_flush(nd_region) < 0)
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_WRITETHRU)
+			|| nvdimm_has_flush(nd_region) < 0)
 		dev_warn(dev, "unable to guarantee persistence of writes\n");
 
 	if (!devm_request_mem_region(dev, res->start, resource_size(res),
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b7cb5066d961..b668ba455c39 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -947,7 +947,7 @@ void nvdimm_flush(struct nd_region *nd_region)
 	 * The first wmb() is needed to 'sfence' all previous writes
 	 * such that they are architecturally visible for the platform
 	 * buffer flush.  Note that we've already arranged for pmem
-	 * writes to avoid the cache via arch_memcpy_to_pmem().  The
+	 * writes to avoid the cache via memcpy_writethru().  The
 	 * final wmb() ensures ordering for the NVDIMM flush write.
 	 */
 	wmb();
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d3158e74a59e..156f067d4db5 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -16,6 +16,9 @@ struct dax_operations {
 	 */
 	long (*direct_access)(struct dax_device *, pgoff_t, long,
 			void **, pfn_t *);
+	/* copy_from_iter: dax-driver override for default copy_from_iter */
+	size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
+			struct iov_iter *);
 };
 
 int dax_read_lock(void);
diff --git a/include/linux/string.h b/include/linux/string.h
index 9d6f189157e2..f4e166d88e2a 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -122,6 +122,12 @@ static inline __must_check int memcpy_mcsafe(void *dst, const void *src,
 	return 0;
 }
 #endif
+#ifndef __HAVE_ARCH_MEMCPY_WRITETHRU
+static inline void memcpy_writethru(void *dst, const void *src, size_t cnt)
+{
+	memcpy(dst, src, cnt);
+}
+#endif
 void *memchr_inv(const void *s, int c, size_t n);
 char *strreplace(char *s, char old, char new);
 
diff --git a/include/linux/uio.h b/include/linux/uio.h
index f2d36a3d3005..d284cb5e89fa 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -95,6 +95,21 @@ size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
 size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
 bool copy_from_iter_full(void *addr, size_t bytes, struct iov_iter *i);
 size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i);
+#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
+/*
+ * Note, users like pmem that depend on the stricter semantics of
+ * copy_from_iter_writethru() than copy_from_iter_nocache() must check
+ * for IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_WRITETHROUGH) before assuming
+ * that the destination is flushed from the cache on return.
+ */
+size_t copy_from_iter_writethru(void *addr, size_t bytes, struct iov_iter *i);
+#else
+static inline size_t copy_from_iter_writethru(void *addr, size_t bytes,
+		struct iov_iter *i)
+{
+	return copy_from_iter_nocache(addr, bytes, i);
+}
+#endif
 bool copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i);
 size_t iov_iter_zero(size_t bytes, struct iov_iter *);
 unsigned long iov_iter_alignment(const struct iov_iter *i);
diff --git a/lib/Kconfig b/lib/Kconfig
index 0c8b78a9ae2e..db31bc186df2 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -548,6 +548,9 @@ config ARCH_HAS_SG_CHAIN
 config ARCH_HAS_PMEM_API
 	bool
 
+config ARCH_HAS_UACCESS_WRITETHRU
+	bool
+
 config ARCH_HAS_MMIO_FLUSH
 	bool
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f7c93568ec99..afc3dc75346c 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -615,6 +615,28 @@ size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
 }
 EXPORT_SYMBOL(copy_from_iter_nocache);
 
+#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
+size_t copy_from_iter_writethru(void *addr, size_t bytes, struct iov_iter *i)
+{
+	char *to = addr;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
+	iterate_and_advance(i, bytes, v,
+		__copy_from_user_inatomic_writethru((to += v.iov_len) - v.iov_len,
+					 v.iov_base, v.iov_len),
+		memcpy_page_writethru((to += v.bv_len) - v.bv_len, v.bv_page,
+				 v.bv_offset, v.bv_len),
+		memcpy_writethru((to += v.iov_len) - v.iov_len, v.iov_base,
+			v.iov_len)
+	)
+
+	return bytes;
+}
+EXPORT_SYMBOL_GPL(copy_from_iter_writethru);
+#endif
+
 bool copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i)
 {
 	char *to = addr;

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] x86, uaccess, pmem: introduce copy_from_iter_writethru for dax + pmem
  2017-04-26 21:56 ` [RFC PATCH] x86, uaccess, pmem: introduce copy_from_iter_writethru for dax + pmem Dan Williams
@ 2017-04-27  6:30   ` Ingo Molnar
  2017-04-28 19:39     ` [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations Dan Williams
  0 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2017-04-27  6:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: viro, Jan Kara, Matthew Wilcox, x86, linux-kernel, hch,
	linux-block, linux-nvdimm, Jeff Moyer, Ingo Molnar,
	H. Peter Anvin, linux-fsdevel, Thomas Gleixner, Ross Zwisler


* Dan Williams <dan.j.williams@intel.com> wrote:

> +#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
> +#define __HAVE_ARCH_MEMCPY_WRITETHRU 1
> +void memcpy_writethru(void *dst, const void *src, size_t cnt);
> +#endif

This should be named memcpy_wt(), which is the well-known postfix for 
write-through.

We already have ioremap_wt(), set_memory_wt(), etc. - no need to introduce a 
longer variant with uncommon spelling.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-04-27  6:30   ` Ingo Molnar
@ 2017-04-28 19:39     ` Dan Williams
  2017-05-05  6:54       ` Ingo Molnar
                         ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Dan Williams @ 2017-04-28 19:39 UTC (permalink / raw)
  To: viro
  Cc: Jan Kara, Matthew Wilcox, x86, linux-kernel, hch, linux-block,
	linux-nvdimm, jmoyer, Ingo Molnar, H. Peter Anvin, linux-fsdevel,
	Thomas Gleixner, ross.zwisler

The pmem driver has a need to transfer data with a persistent memory
destination and be able to rely on the fact that the destination writes
are not cached. It is sufficient for the writes to be flushed to a
cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
userspace to call fsync() to ensure data-writes have reached a
power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
REQ_FLUSH to the pmem driver which will turn around and fence previous
writes with an "sfence".

Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
that guarantee that the destination buffer is not dirty in the cpu cache
on completion. The new copy_from_iter_wt and sub-routines will be used
to replace the "pmem api" (include/linux/pmem.h +
arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
otherwise.

This is meant to satisfy the concern from Linus that if a driver wants
to do something beyond the normal nocache semantics it should be
something private to that driver [1], and Al's concern that anything
uaccess related belongs with the rest of the uaccess code [2].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html

Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Changes since the initial RFC:
* s/writethru/wt/ since we already have ioremap_wt(), set_memory_wt(),
  etc. (Ingo)

 arch/x86/Kconfig                  |    1 
 arch/x86/include/asm/string_64.h  |    5 +
 arch/x86/include/asm/uaccess_64.h |   11 +++
 arch/x86/lib/usercopy_64.c        |  128 +++++++++++++++++++++++++++++++++++++
 drivers/acpi/nfit/core.c          |    3 -
 drivers/nvdimm/claim.c            |    2 -
 drivers/nvdimm/pmem.c             |   13 +++-
 drivers/nvdimm/region_devs.c      |    4 +
 include/linux/dax.h               |    3 +
 include/linux/string.h            |    6 ++
 include/linux/uio.h               |   15 ++++
 lib/Kconfig                       |    3 +
 lib/iov_iter.c                    |   21 ++++++
 13 files changed, 208 insertions(+), 7 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1d50fdff77ee..398117923b1c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,6 +54,7 @@ config X86
 	select ARCH_HAS_KCOV			if X86_64
 	select ARCH_HAS_MMIO_FLUSH
 	select ARCH_HAS_PMEM_API		if X86_64
+	select ARCH_HAS_UACCESS_WT		if X86_64
 	select ARCH_HAS_SET_MEMORY
 	select ARCH_HAS_SG_CHAIN
 	select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 733bae07fb29..dfbd66b11c72 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -109,6 +109,11 @@ memcpy_mcsafe(void *dst, const void *src, size_t cnt)
 	return 0;
 }
 
+#ifdef CONFIG_ARCH_HAS_UACCESS_WT
+#define __HAVE_ARCH_MEMCPY_WT 1
+void memcpy_wt(void *dst, const void *src, size_t cnt);
+#endif
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index c5504b9a472e..07ded30c7e89 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -171,6 +171,10 @@ unsigned long raw_copy_in_user(void __user *dst, const void __user *src, unsigne
 extern long __copy_user_nocache(void *dst, const void __user *src,
 				unsigned size, int zerorest);
 
+extern long __copy_user_wt(void *dst, const void __user *src, unsigned size);
+extern void memcpy_page_wt(char *to, struct page *page, size_t offset,
+			   size_t len);
+
 static inline int
 __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
 				  unsigned size)
@@ -179,6 +183,13 @@ __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
 	return __copy_user_nocache(dst, src, size, 0);
 }
 
+static inline int
+__copy_from_user_inatomic_wt(void *dst, const void __user *src, unsigned size)
+{
+	kasan_check_write(dst, size);
+	return __copy_user_wt(dst, src, size);
+}
+
 unsigned long
 copy_user_handle_tail(char *to, char *from, unsigned len);
 
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 3b7c40a2e3e1..0aeff66a022f 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -7,6 +7,7 @@
  */
 #include <linux/export.h>
 #include <linux/uaccess.h>
+#include <linux/highmem.h>
 
 /*
  * Zero Userspace
@@ -73,3 +74,130 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
 	clac();
 	return len;
 }
+
+#ifdef CONFIG_ARCH_HAS_UACCESS_WT
+/**
+ * clean_cache_range - write back a cache range with CLWB
+ * @vaddr:	virtual start address
+ * @size:	number of bytes to write back
+ *
+ * Write back a cache range using the CLWB (cache line write back)
+ * instruction. Note that @size is internally rounded up to be cache
+ * line size aligned.
+ */
+static void clean_cache_range(void *addr, size_t size)
+{
+	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
+	unsigned long clflush_mask = x86_clflush_size - 1;
+	void *vend = addr + size;
+	void *p;
+
+	for (p = (void *)((unsigned long)addr & ~clflush_mask);
+	     p < vend; p += x86_clflush_size)
+		clwb(p);
+}
+
+long __copy_user_wt(void *dst, const void __user *src, unsigned size)
+{
+	unsigned long flushed, dest = (unsigned long) dst;
+	long rc = __copy_user_nocache(dst, src, size, 0);
+
+	/*
+	 * __copy_user_nocache() uses non-temporal stores for the bulk
+	 * of the transfer, but we need to manually flush if the
+	 * transfer is unaligned. A cached memory copy is used when
+	 * destination or size is not naturally aligned. That is:
+	 *   - Require 8-byte alignment when size is 8 bytes or larger.
+	 *   - Require 4-byte alignment when size is 4 bytes.
+	 */
+	if (size < 8) {
+		if (!IS_ALIGNED(dest, 4) || size != 4)
+			clean_cache_range(dst, 1);
+	} else {
+		if (!IS_ALIGNED(dest, 8)) {
+			dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
+			clean_cache_range(dst, 1);
+		}
+
+		flushed = dest - (unsigned long) dst;
+		if (size > flushed && !IS_ALIGNED(size - flushed, 8))
+			clean_cache_range(dst + size - 1, 1);
+	}
+
+	return rc;
+}
+
+void memcpy_wt(void *_dst, const void *_src, size_t size)
+{
+	unsigned long dest = (unsigned long) _dst;
+	unsigned long source = (unsigned long) _src;
+
+	/* cache copy and flush to align dest */
+	if (!IS_ALIGNED(dest, 8)) {
+		unsigned len = min_t(unsigned, size, ALIGN(dest, 8) - dest);
+
+		memcpy((void *) dest, (void *) source, len);
+		clean_cache_range((void *) dest, len);
+		dest += len;
+		source += len;
+		size -= len;
+		if (!size)
+			return;
+	}
+
+	/* 4x8 movnti loop */
+	while (size >= 32) {
+		asm("movq    (%0), %%r8\n"
+		    "movq   8(%0), %%r9\n"
+		    "movq  16(%0), %%r10\n"
+		    "movq  24(%0), %%r11\n"
+		    "movnti  %%r8,   (%1)\n"
+		    "movnti  %%r9,  8(%1)\n"
+		    "movnti %%r10, 16(%1)\n"
+		    "movnti %%r11, 24(%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8", "r9", "r10", "r11");
+		dest += 32;
+		source += 32;
+		size -= 32;
+	}
+
+	/* 1x8 movnti loop */
+	while (size >= 8) {
+		asm("movq    (%0), %%r8\n"
+		    "movnti  %%r8,   (%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8");
+		dest += 8;
+		source += 8;
+		size -= 8;
+	}
+
+	/* 1x4 movnti loop */
+	while (size >= 4) {
+		asm("movl    (%0), %%r8d\n"
+		    "movnti  %%r8d,   (%1)\n"
+		    :: "r" (source), "r" (dest)
+		    : "memory", "r8");
+		dest += 4;
+		source += 4;
+		size -= 4;
+	}
+
+	/* cache copy for remaining bytes */
+	if (size) {
+		memcpy((void *) dest, (void *) source, size);
+		clean_cache_range((void *) dest, size);
+	}
+}
+EXPORT_SYMBOL_GPL(memcpy_wt);
+
+void memcpy_page_wt(char *to, struct page *page, size_t offset,
+		size_t len)
+{
+	char *from = kmap_atomic(page);
+
+	memcpy_wt(to, from + offset, len);
+	kunmap_atomic(from);
+}
+#endif
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index d0c07b2344e4..be9bba609f26 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -1776,8 +1776,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
 		}
 
 		if (rw)
-			memcpy_to_pmem(mmio->addr.aperture + offset,
-					iobuf + copied, c);
+			memcpy_wt(mmio->addr.aperture + offset, iobuf + copied, c);
 		else {
 			if (nfit_blk->dimm_flags & NFIT_BLK_READ_FLUSH)
 				mmio_flush_range((void __force *)
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index 3a35e8028b9c..864ed42baaf0 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -266,7 +266,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
 			rc = -EIO;
 	}
 
-	memcpy_to_pmem(nsio->addr + offset, buf, size);
+	memcpy_wt(nsio->addr + offset, buf, size);
 	nvdimm_flush(to_nd_region(ndns->dev.parent));
 
 	return rc;
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3b3dab73d741..4be8f30de9b3 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -28,6 +28,7 @@
 #include <linux/pfn_t.h>
 #include <linux/slab.h>
 #include <linux/pmem.h>
+#include <linux/uio.h>
 #include <linux/dax.h>
 #include <linux/nd.h>
 #include "pmem.h"
@@ -79,7 +80,7 @@ static void write_pmem(void *pmem_addr, struct page *page,
 {
 	void *mem = kmap_atomic(page);
 
-	memcpy_to_pmem(pmem_addr, mem + off, len);
+	memcpy_wt(pmem_addr, mem + off, len);
 	kunmap_atomic(mem);
 }
 
@@ -234,8 +235,15 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
 	return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
 }
 
+static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i)
+{
+	return copy_from_iter_wt(addr, bytes, i);
+}
+
 static const struct dax_operations pmem_dax_ops = {
 	.direct_access = pmem_dax_direct_access,
+	.copy_from_iter = pmem_copy_from_iter,
 };
 
 static void pmem_release_queue(void *q)
@@ -288,7 +296,8 @@ static int pmem_attach_disk(struct device *dev,
 	dev_set_drvdata(dev, pmem);
 	pmem->phys_addr = res->start;
 	pmem->size = resource_size(res);
-	if (nvdimm_has_flush(nd_region) < 0)
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_WT)
+			|| nvdimm_has_flush(nd_region) < 0)
 		dev_warn(dev, "unable to guarantee persistence of writes\n");
 
 	if (!devm_request_mem_region(dev, res->start, resource_size(res),
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b7cb5066d961..016af2a6694d 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -947,8 +947,8 @@ void nvdimm_flush(struct nd_region *nd_region)
 	 * The first wmb() is needed to 'sfence' all previous writes
 	 * such that they are architecturally visible for the platform
 	 * buffer flush.  Note that we've already arranged for pmem
-	 * writes to avoid the cache via arch_memcpy_to_pmem().  The
-	 * final wmb() ensures ordering for the NVDIMM flush write.
+	 * writes to avoid the cache via memcpy_wt().  The final wmb()
+	 * ensures ordering for the NVDIMM flush write.
 	 */
 	wmb();
 	for (i = 0; i < nd_region->ndr_mappings; i++)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d3158e74a59e..156f067d4db5 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -16,6 +16,9 @@ struct dax_operations {
 	 */
 	long (*direct_access)(struct dax_device *, pgoff_t, long,
 			void **, pfn_t *);
+	/* copy_from_iter: dax-driver override for default copy_from_iter */
+	size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
+			struct iov_iter *);
 };
 
 int dax_read_lock(void);
diff --git a/include/linux/string.h b/include/linux/string.h
index 9d6f189157e2..245e0a29b7e5 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -122,6 +122,12 @@ static inline __must_check int memcpy_mcsafe(void *dst, const void *src,
 	return 0;
 }
 #endif
+#ifndef __HAVE_ARCH_MEMCPY_WT
+static inline void memcpy_wt(void *dst, const void *src, size_t cnt)
+{
+	memcpy(dst, src, cnt);
+}
+#endif
 void *memchr_inv(const void *s, int c, size_t n);
 char *strreplace(char *s, char old, char new);
 
diff --git a/include/linux/uio.h b/include/linux/uio.h
index f2d36a3d3005..30c43aa371b5 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -95,6 +95,21 @@ size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
 size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
 bool copy_from_iter_full(void *addr, size_t bytes, struct iov_iter *i);
 size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i);
+#ifdef CONFIG_ARCH_HAS_UACCESS_WT
+/*
+ * Note, users like pmem that depend on the stricter semantics of
+ * copy_from_iter_wt() than copy_from_iter_nocache() must check
+ * for IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_WRITETHROUGH) before assuming
+ * that the destination is flushed from the cache on return.
+ */
+size_t copy_from_iter_wt(void *addr, size_t bytes, struct iov_iter *i);
+#else
+static inline size_t copy_from_iter_wt(void *addr, size_t bytes,
+				       struct iov_iter *i)
+{
+	return copy_from_iter_nocache(addr, bytes, i);
+}
+#endif
 bool copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i);
 size_t iov_iter_zero(size_t bytes, struct iov_iter *);
 unsigned long iov_iter_alignment(const struct iov_iter *i);
diff --git a/lib/Kconfig b/lib/Kconfig
index 0c8b78a9ae2e..f0752a7a9001 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -548,6 +548,9 @@ config ARCH_HAS_SG_CHAIN
 config ARCH_HAS_PMEM_API
 	bool
 
+config ARCH_HAS_UACCESS_WT
+	bool
+
 config ARCH_HAS_MMIO_FLUSH
 	bool
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f7c93568ec99..19ab9af091f9 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -615,6 +615,27 @@ size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
 }
 EXPORT_SYMBOL(copy_from_iter_nocache);
 
+#ifdef CONFIG_ARCH_HAS_UACCESS_WT
+size_t copy_from_iter_wt(void *addr, size_t bytes, struct iov_iter *i)
+{
+	char *to = addr;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
+	iterate_and_advance(i, bytes, v,
+		__copy_from_user_inatomic_wt((to += v.iov_len) - v.iov_len,
+					 v.iov_base, v.iov_len),
+		memcpy_page_wt((to += v.bv_len) - v.bv_len, v.bv_page,
+				 v.bv_offset, v.bv_len),
+		memcpy_wt((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
+	)
+
+	return bytes;
+}
+EXPORT_SYMBOL_GPL(copy_from_iter_wt);
+#endif
+
 bool copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i)
 {
 	char *to = addr;

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-04-28 19:39     ` [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations Dan Williams
@ 2017-05-05  6:54       ` Ingo Molnar
  2017-05-05 14:12         ` Dan Williams
  2017-05-05 20:39       ` Kani, Toshimitsu
  2017-05-08 20:32       ` Ross Zwisler
  2 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2017-05-05  6:54 UTC (permalink / raw)
  To: Dan Williams
  Cc: viro, Jan Kara, Matthew Wilcox, x86, linux-kernel, hch,
	linux-block, linux-nvdimm, jmoyer, Ingo Molnar, H. Peter Anvin,
	linux-fsdevel, Thomas Gleixner, ross.zwisler


* Dan Williams <dan.j.williams@intel.com> wrote:

> The pmem driver has a need to transfer data with a persistent memory
> destination and be able to rely on the fact that the destination writes
> are not cached. It is sufficient for the writes to be flushed to a
> cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
> userspace to call fsync() to ensure data-writes have reached a
> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
> REQ_FLUSH to the pmem driver which will turn around and fence previous
> writes with an "sfence".
> 
> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
> that guarantee that the destination buffer is not dirty in the cpu cache
> on completion. The new copy_from_iter_wt and sub-routines will be used
> to replace the "pmem api" (include/linux/pmem.h +
> arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
> and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
> symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
> otherwise.
> 
> This is meant to satisfy the concern from Linus that if a driver wants
> to do something beyond the normal nocache semantics it should be
> something private to that driver [1], and Al's concern that anything
> uaccess related belongs with the rest of the uaccess code [2].
> 
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
> 
> Cc: <x86@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> Changes since the initial RFC:
> * s/writethru/wt/ since we already have ioremap_wt(), set_memory_wt(),
>   etc. (Ingo)

Looks good to me. I suspect you'd like to carry this in the nvdimm tree?

Acked-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-05-05  6:54       ` Ingo Molnar
@ 2017-05-05 14:12         ` Dan Williams
  0 siblings, 0 replies; 16+ messages in thread
From: Dan Williams @ 2017-05-05 14:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Al Viro, Jan Kara, Matthew Wilcox, X86 ML,
	linux-kernel@vger.kernel.org, Christoph Hellwig, linux-block,
	linux-nvdimm@lists.01.org, jmoyer, Ingo Molnar, H. Peter Anvin,
	linux-fsdevel, Thomas Gleixner, Ross Zwisler

On Thu, May 4, 2017 at 11:54 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Dan Williams <dan.j.williams@intel.com> wrote:
>
>> The pmem driver has a need to transfer data with a persistent memory
>> destination and be able to rely on the fact that the destination writes
>> are not cached. It is sufficient for the writes to be flushed to a
>> cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
>> userspace to call fsync() to ensure data-writes have reached a
>> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
>> REQ_FLUSH to the pmem driver which will turn around and fence previous
>> writes with an "sfence".
>>
>> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
>> that guarantee that the destination buffer is not dirty in the cpu cache
>> on completion. The new copy_from_iter_wt and sub-routines will be used
>> to replace the "pmem api" (include/linux/pmem.h +
>> arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
>> and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
>> symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
>> otherwise.
>>
>> This is meant to satisfy the concern from Linus that if a driver wants
>> to do something beyond the normal nocache semantics it should be
>> something private to that driver [1], and Al's concern that anything
>> uaccess related belongs with the rest of the uaccess code [2].
>>
>> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
>> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
>>
>> Cc: <x86@kernel.org>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>> Changes since the initial RFC:
>> * s/writethru/wt/ since we already have ioremap_wt(), set_memory_wt(),
>>   etc. (Ingo)
>
> Looks good to me. I suspect you'd like to carry this in the nvdimm tree?
>
> Acked-by: Ingo Molnar <mingo@kernel.org>

Thanks, Ingo!. Yes, I'll carry it in nvdimm.git for 4.13.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-04-28 19:39     ` [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations Dan Williams
  2017-05-05  6:54       ` Ingo Molnar
@ 2017-05-05 20:39       ` Kani, Toshimitsu
  2017-05-05 22:25         ` Dan Williams
  2017-05-08 20:32       ` Ross Zwisler
  2 siblings, 1 reply; 16+ messages in thread
From: Kani, Toshimitsu @ 2017-05-05 20:39 UTC (permalink / raw)
  To: dan.j.williams@intel.com, viro@zeniv.linux.org.uk
  Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	jmoyer@redhat.com, tglx@linutronix.de, hch@lst.de, x86@kernel.org,
	mawilcox@microsoft.com, hpa@zytor.com, linux-nvdimm@lists.01.org,
	mingo@redhat.com, linux-fsdevel@vger.kernel.org,
	ross.zwisler@linux.intel.com, jack@suse.cz

T24gRnJpLCAyMDE3LTA0LTI4IGF0IDEyOjM5IC0wNzAwLCBEYW4gV2lsbGlhbXMgd3JvdGU6DQo+
IFRoZSBwbWVtIGRyaXZlciBoYXMgYSBuZWVkIHRvIHRyYW5zZmVyIGRhdGEgd2l0aCBhIHBlcnNp
c3RlbnQgbWVtb3J5DQo+IGRlc3RpbmF0aW9uIGFuZCBiZSBhYmxlIHRvIHJlbHkgb24gdGhlIGZh
Y3QgdGhhdCB0aGUgZGVzdGluYXRpb24NCj4gd3JpdGVzIGFyZSBub3QgY2FjaGVkLiBJdCBpcyBz
dWZmaWNpZW50IGZvciB0aGUgd3JpdGVzIHRvIGJlIGZsdXNoZWQNCj4gdG8gYSBjcHUtc3RvcmUt
YnVmZmVyIChub24tdGVtcG9yYWwgLyAibW92bnQiIGluIHg4NiB0ZXJtcyksIGFzIHdlDQo+IGV4
cGVjdCB1c2Vyc3BhY2UgdG8gY2FsbCBmc3luYygpIHRvIGVuc3VyZSBkYXRhLXdyaXRlcyBoYXZl
IHJlYWNoZWQgYQ0KPiBwb3dlci1mYWlsLXNhZmUgem9uZSBpbiB0aGUgcGxhdGZvcm0uIFRoZSBm
c3luYygpIHRyaWdnZXJzIGEgUkVRX0ZVQQ0KPiBvciBSRVFfRkxVU0ggdG8gdGhlIHBtZW0gZHJp
dmVyIHdoaWNoIHdpbGwgdHVybiBhcm91bmQgYW5kIGZlbmNlDQo+IHByZXZpb3VzIHdyaXRlcyB3
aXRoIGFuICJzZmVuY2UiLg0KPiANCj4gSW1wbGVtZW50IGEgX19jb3B5X2Zyb21fdXNlcl9pbmF0
b21pY193dCwgbWVtY3B5X3BhZ2Vfd3QsIGFuZA0KPiBtZW1jcHlfd3QsIHRoYXQgZ3VhcmFudGVl
IHRoYXQgdGhlIGRlc3RpbmF0aW9uIGJ1ZmZlciBpcyBub3QgZGlydHkgaW4NCj4gdGhlIGNwdSBj
YWNoZSBvbiBjb21wbGV0aW9uLiBUaGUgbmV3IGNvcHlfZnJvbV9pdGVyX3d0IGFuZCBzdWItDQo+
IHJvdXRpbmVzIHdpbGwgYmUgdXNlZCB0byByZXBsYWNlIHRoZSAicG1lbSBhcGkiIChpbmNsdWRl
L2xpbnV4L3BtZW0uaA0KPiArIGFyY2gveDg2L2luY2x1ZGUvYXNtL3BtZW0uaCkuIFRoZSBhdmFp
bGFiaWxpdHkgb2YNCj4gY29weV9mcm9tX2l0ZXJfd3QoKSBhbmQgbWVtY3B5X3d0KCkgYXJlIGdh
dGVkIGJ5IHRoZQ0KPiBDT05GSUdfQVJDSF9IQVNfVUFDQ0VTU19XVCBjb25maWcgc3ltYm9sLCBh
bmQgZmFsbGJhY2sgdG8NCj4gY29weV9mcm9tX2l0ZXJfbm9jYWNoZSgpIGFuZCBwbGFpbiBtZW1j
cHkoKSBvdGhlcndpc2UuDQo+IA0KPiBUaGlzIGlzIG1lYW50IHRvIHNhdGlzZnkgdGhlIGNvbmNl
cm4gZnJvbSBMaW51cyB0aGF0IGlmIGEgZHJpdmVyDQo+IHdhbnRzIHRvIGRvIHNvbWV0aGluZyBi
ZXlvbmQgdGhlIG5vcm1hbCBub2NhY2hlIHNlbWFudGljcyBpdCBzaG91bGQNCj4gYmUgc29tZXRo
aW5nIHByaXZhdGUgdG8gdGhhdCBkcml2ZXIgWzFdLCBhbmQgQWwncyBjb25jZXJuIHRoYXQNCj4g
YW55dGhpbmcgdWFjY2VzcyByZWxhdGVkIGJlbG9uZ3Mgd2l0aCB0aGUgcmVzdCBvZiB0aGUgdWFj
Y2VzcyBjb2RlDQo+IFsyXS4NCj4gDQo+IFsxXTogaHR0cHM6Ly9saXN0cy4wMS5vcmcvcGlwZXJt
YWlsL2xpbnV4LW52ZGltbS8yMDE3LUphbnVhcnkvMDA4MzY0Lg0KPiBodG1sDQo+IFsyXTogaHR0
cHM6Ly9saXN0cy4wMS5vcmcvcGlwZXJtYWlsL2xpbnV4LW52ZGltbS8yMDE3LUFwcmlsLzAwOTk0
Mi5odA0KPiBtbA0KPiANCj4gQ2M6IDx4ODZAa2VybmVsLm9yZz4NCj4gQ2M6IEphbiBLYXJhIDxq
YWNrQHN1c2UuY3o+DQo+IENjOiBKZWZmIE1veWVyIDxqbW95ZXJAcmVkaGF0LmNvbT4NCj4gQ2M6
IEluZ28gTW9sbmFyIDxtaW5nb0ByZWRoYXQuY29tPg0KPiBDYzogQ2hyaXN0b3BoIEhlbGx3aWcg
PGhjaEBsc3QuZGU+DQo+IENjOiAiSC4gUGV0ZXIgQW52aW4iIDxocGFAenl0b3IuY29tPg0KPiBD
YzogQWwgVmlybyA8dmlyb0B6ZW5pdi5saW51eC5vcmcudWs+DQo+IENjOiBUaG9tYXMgR2xlaXhu
ZXIgPHRnbHhAbGludXRyb25peC5kZT4NCj4gQ2M6IE1hdHRoZXcgV2lsY294IDxtYXdpbGNveEBt
aWNyb3NvZnQuY29tPg0KPiBDYzogUm9zcyBad2lzbGVyIDxyb3NzLnp3aXNsZXJAbGludXguaW50
ZWwuY29tPg0KPiBTaWduZWQtb2ZmLWJ5OiBEYW4gV2lsbGlhbXMgPGRhbi5qLndpbGxpYW1zQGlu
dGVsLmNvbT4NCj4gLS0tDQo+IENoYW5nZXMgc2luY2UgdGhlIGluaXRpYWwgUkZDOg0KPiAqIHMv
d3JpdGV0aHJ1L3d0LyBzaW5jZSB3ZSBhbHJlYWR5IGhhdmUgaW9yZW1hcF93dCgpLA0KPiBzZXRf
bWVtb3J5X3d0KCksIGV0Yy4gKEluZ28pDQoNClNvcnJ5IEkgc2hvdWxkIGhhdmUgc2FpZCBlYXJs
aWVyLCBidXQgSSB0aGluayB0aGUgdGVybSAid3QiIGlzDQptaXNsZWFkaW5nLiAgTm9uLXRlbXBv
cmFsIHN0b3JlcyB1c2VkIGluIG1lbWNweV93dCgpIHByb3ZpZGUgV0MNCnNlbWFudGljcywgbm90
IFdUIHNlbWFudGljcy4gIEhvdyBhYm91dCB1c2luZyAibm9jYWNoZSIgYXMgaXQncyBiZWVuDQp1
c2VkIGluIF9fY29weV91c2VyX25vY2FjaGUoKT8NCg0KVGhhbmtzLA0KLVRvc2hpDQo=

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-05-05 20:39       ` Kani, Toshimitsu
@ 2017-05-05 22:25         ` Dan Williams
  2017-05-05 22:44           ` Kani, Toshimitsu
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2017-05-05 22:25 UTC (permalink / raw)
  To: Kani, Toshimitsu
  Cc: viro@zeniv.linux.org.uk, linux-kernel@vger.kernel.org,
	linux-block@vger.kernel.org, jmoyer@redhat.com,
	tglx@linutronix.de, hch@lst.de, x86@kernel.org,
	mawilcox@microsoft.com, hpa@zytor.com, linux-nvdimm@lists.01.org,
	mingo@redhat.com, linux-fsdevel@vger.kernel.org,
	ross.zwisler@linux.intel.com, jack@suse.cz

On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> On Fri, 2017-04-28 at 12:39 -0700, Dan Williams wrote:
>> The pmem driver has a need to transfer data with a persistent memory
>> destination and be able to rely on the fact that the destination
>> writes are not cached. It is sufficient for the writes to be flushed
>> to a cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we
>> expect userspace to call fsync() to ensure data-writes have reached a
>> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA
>> or REQ_FLUSH to the pmem driver which will turn around and fence
>> previous writes with an "sfence".
>>
>> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and
>> memcpy_wt, that guarantee that the destination buffer is not dirty in
>> the cpu cache on completion. The new copy_from_iter_wt and sub-
>> routines will be used to replace the "pmem api" (include/linux/pmem.h
>> + arch/x86/include/asm/pmem.h). The availability of
>> copy_from_iter_wt() and memcpy_wt() are gated by the
>> CONFIG_ARCH_HAS_UACCESS_WT config symbol, and fallback to
>> copy_from_iter_nocache() and plain memcpy() otherwise.
>>
>> This is meant to satisfy the concern from Linus that if a driver
>> wants to do something beyond the normal nocache semantics it should
>> be something private to that driver [1], and Al's concern that
>> anything uaccess related belongs with the rest of the uaccess code
>> [2].
>>
>> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.
>> html
>> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.ht
>> ml
>>
>> Cc: <x86@kernel.org>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>> Changes since the initial RFC:
>> * s/writethru/wt/ since we already have ioremap_wt(),
>> set_memory_wt(), etc. (Ingo)
>
> Sorry I should have said earlier, but I think the term "wt" is
> misleading.  Non-temporal stores used in memcpy_wt() provide WC
> semantics, not WT semantics.

The non-temporal stores do, but memcpy_wt() is using a combination of
non-temporal stores and explicit cache flushing.

> How about using "nocache" as it's been
> used in __copy_user_nocache()?

The difference in my mind is that the "_nocache" suffix indicates
opportunistic / optional cache pollution avoidance whereas "_wt"
strictly arranges for caches not to contain dirty data upon completion
of the routine. For example, non-temporal stores on older x86 cpus
could potentially leave dirty data in the cache, so memcpy_wt on those
cpus would need to use explicit cache flushing.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-05-05 22:25         ` Dan Williams
@ 2017-05-05 22:44           ` Kani, Toshimitsu
  2017-05-06  2:15             ` Dan Williams
  0 siblings, 1 reply; 16+ messages in thread
From: Kani, Toshimitsu @ 2017-05-05 22:44 UTC (permalink / raw)
  To: dan.j.williams@intel.com
  Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	jmoyer@redhat.com, tglx@linutronix.de, hch@lst.de,
	viro@zeniv.linux.org.uk, x86@kernel.org, mawilcox@microsoft.com,
	hpa@zytor.com, linux-nvdimm@lists.01.org, mingo@redhat.com,
	linux-fsdevel@vger.kernel.org, ross.zwisler@linux.intel.com,
	jack@suse.cz

T24gRnJpLCAyMDE3LTA1LTA1IGF0IDE1OjI1IC0wNzAwLCBEYW4gV2lsbGlhbXMgd3JvdGU6DQo+
IE9uIEZyaSwgTWF5IDUsIDIwMTcgYXQgMTozOSBQTSwgS2FuaSwgVG9zaGltaXRzdSA8dG9zaGku
a2FuaUBocGUuY29tPg0KPiB3cm90ZToNCiA6DQo+ID4gPiAtLS0NCj4gPiA+IENoYW5nZXMgc2lu
Y2UgdGhlIGluaXRpYWwgUkZDOg0KPiA+ID4gKiBzL3dyaXRldGhydS93dC8gc2luY2Ugd2UgYWxy
ZWFkeSBoYXZlIGlvcmVtYXBfd3QoKSwNCj4gPiA+IHNldF9tZW1vcnlfd3QoKSwgZXRjLiAoSW5n
bykNCj4gPiANCj4gPiBTb3JyeSBJIHNob3VsZCBoYXZlIHNhaWQgZWFybGllciwgYnV0IEkgdGhp
bmsgdGhlIHRlcm0gInd0IiBpcw0KPiA+IG1pc2xlYWRpbmcuwqDCoE5vbi10ZW1wb3JhbCBzdG9y
ZXMgdXNlZCBpbiBtZW1jcHlfd3QoKSBwcm92aWRlIFdDDQo+ID4gc2VtYW50aWNzLCBub3QgV1Qg
c2VtYW50aWNzLg0KPiANCj4gVGhlIG5vbi10ZW1wb3JhbCBzdG9yZXMgZG8sIGJ1dCBtZW1jcHlf
d3QoKSBpcyB1c2luZyBhIGNvbWJpbmF0aW9uIG9mDQo+IG5vbi10ZW1wb3JhbCBzdG9yZXMgYW5k
IGV4cGxpY2l0IGNhY2hlIGZsdXNoaW5nLg0KPiANCj4gPiBIb3cgYWJvdXQgdXNpbmcgIm5vY2Fj
aGUiIGFzIGl0J3MgYmVlbg0KPiA+IHVzZWQgaW4gX19jb3B5X3VzZXJfbm9jYWNoZSgpPw0KPiAN
Cj4gVGhlIGRpZmZlcmVuY2UgaW4gbXkgbWluZCBpcyB0aGF0IHRoZSAiX25vY2FjaGUiIHN1ZmZp
eCBpbmRpY2F0ZXMNCj4gb3Bwb3J0dW5pc3RpYyAvIG9wdGlvbmFsIGNhY2hlIHBvbGx1dGlvbiBh
dm9pZGFuY2Ugd2hlcmVhcyAiX3d0Ig0KPiBzdHJpY3RseSBhcnJhbmdlcyBmb3IgY2FjaGVzIG5v
dCB0byBjb250YWluIGRpcnR5IGRhdGEgdXBvbg0KPiBjb21wbGV0aW9uIG9mIHRoZSByb3V0aW5l
LiBGb3IgZXhhbXBsZSwgbm9uLXRlbXBvcmFsIHN0b3JlcyBvbiBvbGRlcg0KPiB4ODYgY3B1cyBj
b3VsZCBwb3RlbnRpYWxseSBsZWF2ZSBkaXJ0eSBkYXRhIGluIHRoZSBjYWNoZSwgc28NCj4gbWVt
Y3B5X3d0IG9uIHRob3NlIGNwdXMgd291bGQgbmVlZCB0byB1c2UgZXhwbGljaXQgY2FjaGUgZmx1
c2hpbmcuDQoNCkkgc2VlLiAgSSBhZ3JlZSB0aGF0IGl0cyBiZWhhdmlvciBpcyBkaWZmZXJlbnQg
ZnJvbSB0aGUgZXhpc3Rpbmcgb25lDQp3aXRoICJfbm9jYWNoZSIuICAgVGhhdCBzYWlkLCBJIHRo
aW5rICJ3dCIgb3IgIndyaXRlLXRocm91Z2giIGdlbmVyYWxseQ0KbWVhbnMgdGhhdCB3cml0ZXMg
YWxsb2NhdGUgY2FjaGVsaW5lcyBhbmQga2VlcCB0aGVtIGNsZWFuIGJ5IHdyaXRpbmcgdG8NCm1l
bW9yeS4gIFNvLCBzdWJzZXF1ZW50IHJlYWRzIHRvIHRoZSBkZXN0aW5hdGlvbiB3aWxsIGhpdCB0
aGUNCmNhY2hlbGluZXMuICBUaGlzIGlzIG5vdCB0aGUgY2FzZSB3aXRoIHRoaXMgaW50ZXJmYWNl
Lg0KDQpUaGFua3MsDQotVG9zaGkNCiA=

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-05-05 22:44           ` Kani, Toshimitsu
@ 2017-05-06  2:15             ` Dan Williams
  2017-05-06  3:17               ` Kani, Toshimitsu
  2017-05-06  9:46               ` Ingo Molnar
  0 siblings, 2 replies; 16+ messages in thread
From: Dan Williams @ 2017-05-06  2:15 UTC (permalink / raw)
  To: Kani, Toshimitsu
  Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	jmoyer@redhat.com, tglx@linutronix.de, hch@lst.de,
	viro@zeniv.linux.org.uk, x86@kernel.org, mawilcox@microsoft.com,
	hpa@zytor.com, linux-nvdimm@lists.01.org, mingo@redhat.com,
	linux-fsdevel@vger.kernel.org, ross.zwisler@linux.intel.com,
	jack@suse.cz

On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
>> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
>> wrote:
>  :
>> > > ---
>> > > Changes since the initial RFC:
>> > > * s/writethru/wt/ since we already have ioremap_wt(),
>> > > set_memory_wt(), etc. (Ingo)
>> >
>> > Sorry I should have said earlier, but I think the term "wt" is
>> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
>> > semantics, not WT semantics.
>>
>> The non-temporal stores do, but memcpy_wt() is using a combination of
>> non-temporal stores and explicit cache flushing.
>>
>> > How about using "nocache" as it's been
>> > used in __copy_user_nocache()?
>>
>> The difference in my mind is that the "_nocache" suffix indicates
>> opportunistic / optional cache pollution avoidance whereas "_wt"
>> strictly arranges for caches not to contain dirty data upon
>> completion of the routine. For example, non-temporal stores on older
>> x86 cpus could potentially leave dirty data in the cache, so
>> memcpy_wt on those cpus would need to use explicit cache flushing.
>
> I see.  I agree that its behavior is different from the existing one
> with "_nocache".   That said, I think "wt" or "write-through" generally
> means that writes allocate cachelines and keep them clean by writing to
> memory.  So, subsequent reads to the destination will hit the
> cachelines.  This is not the case with this interface.

True... maybe _nocache_strict()? Or, leave it _wt() until someone
comes along and is surprised that the cache is not warm for reads
after memcpy_wt(), at which point we can ask "why not just use plain
memcpy then?", or set the page-attributes to WT.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-05-06  2:15             ` Dan Williams
@ 2017-05-06  3:17               ` Kani, Toshimitsu
  2017-05-06  9:46               ` Ingo Molnar
  1 sibling, 0 replies; 16+ messages in thread
From: Kani, Toshimitsu @ 2017-05-06  3:17 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	jmoyer@redhat.com, tglx@linutronix.de, hch@lst.de,
	viro@zeniv.linux.org.uk, x86@kernel.org, mawilcox@microsoft.com,
	hpa@zytor.com, linux-nvdimm@lists.01.org, mingo@redhat.com,
	linux-fsdevel@vger.kernel.org, ross.zwisler@linux.intel.com,
	jack@suse.cz

PiBPbiBGcmksIE1heSA1LCAyMDE3IGF0IDM6NDQgUE0sIEthbmksIFRvc2hpbWl0c3UgPHRvc2hp
LmthbmlAaHBlLmNvbT4NCj4gd3JvdGU6DQo+ID4gT24gRnJpLCAyMDE3LTA1LTA1IGF0IDE1OjI1
IC0wNzAwLCBEYW4gV2lsbGlhbXMgd3JvdGU6DQo+ID4+IE9uIEZyaSwgTWF5IDUsIDIwMTcgYXQg
MTozOSBQTSwgS2FuaSwgVG9zaGltaXRzdSA8dG9zaGkua2FuaUBocGUuY29tPg0KPiA+PiB3cm90
ZToNCj4gPiAgOg0KPiA+PiA+ID4gLS0tDQo+ID4+ID4gPiBDaGFuZ2VzIHNpbmNlIHRoZSBpbml0
aWFsIFJGQzoNCj4gPj4gPiA+ICogcy93cml0ZXRocnUvd3QvIHNpbmNlIHdlIGFscmVhZHkgaGF2
ZSBpb3JlbWFwX3d0KCksDQo+ID4+ID4gPiBzZXRfbWVtb3J5X3d0KCksIGV0Yy4gKEluZ28pDQo+
ID4+ID4NCj4gPj4gPiBTb3JyeSBJIHNob3VsZCBoYXZlIHNhaWQgZWFybGllciwgYnV0IEkgdGhp
bmsgdGhlIHRlcm0gInd0IiBpcw0KPiA+PiA+IG1pc2xlYWRpbmcuICBOb24tdGVtcG9yYWwgc3Rv
cmVzIHVzZWQgaW4gbWVtY3B5X3d0KCkgcHJvdmlkZSBXQw0KPiA+PiA+IHNlbWFudGljcywgbm90
IFdUIHNlbWFudGljcy4NCj4gPj4NCj4gPj4gVGhlIG5vbi10ZW1wb3JhbCBzdG9yZXMgZG8sIGJ1
dCBtZW1jcHlfd3QoKSBpcyB1c2luZyBhIGNvbWJpbmF0aW9uIG9mDQo+ID4+IG5vbi10ZW1wb3Jh
bCBzdG9yZXMgYW5kIGV4cGxpY2l0IGNhY2hlIGZsdXNoaW5nLg0KPiA+Pg0KPiA+PiA+IEhvdyBh
Ym91dCB1c2luZyAibm9jYWNoZSIgYXMgaXQncyBiZWVuDQo+ID4+ID4gdXNlZCBpbiBfX2NvcHlf
dXNlcl9ub2NhY2hlKCk/DQo+ID4+DQo+ID4+IFRoZSBkaWZmZXJlbmNlIGluIG15IG1pbmQgaXMg
dGhhdCB0aGUgIl9ub2NhY2hlIiBzdWZmaXggaW5kaWNhdGVzDQo+ID4+IG9wcG9ydHVuaXN0aWMg
LyBvcHRpb25hbCBjYWNoZSBwb2xsdXRpb24gYXZvaWRhbmNlIHdoZXJlYXMgIl93dCINCj4gPj4g
c3RyaWN0bHkgYXJyYW5nZXMgZm9yIGNhY2hlcyBub3QgdG8gY29udGFpbiBkaXJ0eSBkYXRhIHVw
b24NCj4gPj4gY29tcGxldGlvbiBvZiB0aGUgcm91dGluZS4gRm9yIGV4YW1wbGUsIG5vbi10ZW1w
b3JhbCBzdG9yZXMgb24gb2xkZXINCj4gPj4geDg2IGNwdXMgY291bGQgcG90ZW50aWFsbHkgbGVh
dmUgZGlydHkgZGF0YSBpbiB0aGUgY2FjaGUsIHNvDQo+ID4+IG1lbWNweV93dCBvbiB0aG9zZSBj
cHVzIHdvdWxkIG5lZWQgdG8gdXNlIGV4cGxpY2l0IGNhY2hlIGZsdXNoaW5nLg0KPiA+DQo+ID4g
SSBzZWUuICBJIGFncmVlIHRoYXQgaXRzIGJlaGF2aW9yIGlzIGRpZmZlcmVudCBmcm9tIHRoZSBl
eGlzdGluZyBvbmUNCj4gPiB3aXRoICJfbm9jYWNoZSIuICAgVGhhdCBzYWlkLCBJIHRoaW5rICJ3
dCIgb3IgIndyaXRlLXRocm91Z2giIGdlbmVyYWxseQ0KPiA+IG1lYW5zIHRoYXQgd3JpdGVzIGFs
bG9jYXRlIGNhY2hlbGluZXMgYW5kIGtlZXAgdGhlbSBjbGVhbiBieSB3cml0aW5nIHRvDQo+ID4g
bWVtb3J5LiAgU28sIHN1YnNlcXVlbnQgcmVhZHMgdG8gdGhlIGRlc3RpbmF0aW9uIHdpbGwgaGl0
IHRoZQ0KPiA+IGNhY2hlbGluZXMuICBUaGlzIGlzIG5vdCB0aGUgY2FzZSB3aXRoIHRoaXMgaW50
ZXJmYWNlLg0KPiANCj4gVHJ1ZS4uLiBtYXliZSBfbm9jYWNoZV9zdHJpY3QoKT8gT3IsIGxlYXZl
IGl0IF93dCgpIHVudGlsIHNvbWVvbmUNCj4gY29tZXMgYWxvbmcgYW5kIGlzIHN1cnByaXNlZCB0
aGF0IHRoZSBjYWNoZSBpcyBub3Qgd2FybSBmb3IgcmVhZHMNCj4gYWZ0ZXIgbWVtY3B5X3d0KCks
IGF0IHdoaWNoIHBvaW50IHdlIGNhbiBhc2sgIndoeSBub3QganVzdCB1c2UgcGxhaW4NCj4gbWVt
Y3B5IHRoZW4/Iiwgb3Igc2V0IHRoZSBwYWdlLWF0dHJpYnV0ZXMgdG8gV1QuDQoNCkkgcHJlZmVy
IF9ub2NhY2hlX3N0cmljdCgpLCBpZiBpdCdzIG5vdCB0b28gbG9uZywgc2luY2UgaXQgYXZvaWRz
IGFueQ0KY29uZnVzaW9uLiAgSWYgb3RoZXIgYXJjaGVzIGFjdHVhbGx5IGltcGxlbWVudCBpdCB3
aXRoIFdUIHNlbWFudGljcywNCndlIG1pZ2h0IGJlY29tZSB0aGUgb25lIHRvIGNoYW5nZSBpdCwg
aW5zdGVhZCBvZiB0aGUgY2FsbGVyLg0KDQpUaGFua3MsDQotVG9zaGkNCg0K

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-05-06  2:15             ` Dan Williams
  2017-05-06  3:17               ` Kani, Toshimitsu
@ 2017-05-06  9:46               ` Ingo Molnar
  2017-05-06 13:57                 ` Dan Williams
  1 sibling, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2017-05-06  9:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kani, Toshimitsu, linux-kernel@vger.kernel.org,
	linux-block@vger.kernel.org, jmoyer@redhat.com,
	tglx@linutronix.de, hch@lst.de, viro@zeniv.linux.org.uk,
	x86@kernel.org, mawilcox@microsoft.com, hpa@zytor.com,
	linux-nvdimm@lists.01.org, mingo@redhat.com,
	linux-fsdevel@vger.kernel.org, ross.zwisler@linux.intel.com,
	jack@suse.cz


* Dan Williams <dan.j.williams@intel.com> wrote:

> On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> > On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
> >> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> >> wrote:
> >  :
> >> > > ---
> >> > > Changes since the initial RFC:
> >> > > * s/writethru/wt/ since we already have ioremap_wt(),
> >> > > set_memory_wt(), etc. (Ingo)
> >> >
> >> > Sorry I should have said earlier, but I think the term "wt" is
> >> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
> >> > semantics, not WT semantics.
> >>
> >> The non-temporal stores do, but memcpy_wt() is using a combination of
> >> non-temporal stores and explicit cache flushing.
> >>
> >> > How about using "nocache" as it's been
> >> > used in __copy_user_nocache()?
> >>
> >> The difference in my mind is that the "_nocache" suffix indicates
> >> opportunistic / optional cache pollution avoidance whereas "_wt"
> >> strictly arranges for caches not to contain dirty data upon
> >> completion of the routine. For example, non-temporal stores on older
> >> x86 cpus could potentially leave dirty data in the cache, so
> >> memcpy_wt on those cpus would need to use explicit cache flushing.
> >
> > I see.  I agree that its behavior is different from the existing one
> > with "_nocache".   That said, I think "wt" or "write-through" generally
> > means that writes allocate cachelines and keep them clean by writing to
> > memory.  So, subsequent reads to the destination will hit the
> > cachelines.  This is not the case with this interface.
> 
> True... maybe _nocache_strict()? Or, leave it _wt() until someone
> comes along and is surprised that the cache is not warm for reads
> after memcpy_wt(), at which point we can ask "why not just use plain
> memcpy then?", or set the page-attributes to WT.

Perhaps a _nocache_flush() postfix, to signal both that it's non-temporal and that 
no cache line is left around afterwards (dirty or clean)?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-05-06  9:46               ` Ingo Molnar
@ 2017-05-06 13:57                 ` Dan Williams
  2017-05-07  8:57                   ` Ingo Molnar
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2017-05-06 13:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kani, Toshimitsu, linux-kernel@vger.kernel.org,
	linux-block@vger.kernel.org, jmoyer@redhat.com,
	tglx@linutronix.de, hch@lst.de, viro@zeniv.linux.org.uk,
	x86@kernel.org, mawilcox@microsoft.com, hpa@zytor.com,
	linux-nvdimm@lists.01.org, mingo@redhat.com,
	linux-fsdevel@vger.kernel.org, ross.zwisler@linux.intel.com,
	jack@suse.cz

On Sat, May 6, 2017 at 2:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Dan Williams <dan.j.williams@intel.com> wrote:
>
>> On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
>> > On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
>> >> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
>> >> wrote:
>> >  :
>> >> > > ---
>> >> > > Changes since the initial RFC:
>> >> > > * s/writethru/wt/ since we already have ioremap_wt(),
>> >> > > set_memory_wt(), etc. (Ingo)
>> >> >
>> >> > Sorry I should have said earlier, but I think the term "wt" is
>> >> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
>> >> > semantics, not WT semantics.
>> >>
>> >> The non-temporal stores do, but memcpy_wt() is using a combination of
>> >> non-temporal stores and explicit cache flushing.
>> >>
>> >> > How about using "nocache" as it's been
>> >> > used in __copy_user_nocache()?
>> >>
>> >> The difference in my mind is that the "_nocache" suffix indicates
>> >> opportunistic / optional cache pollution avoidance whereas "_wt"
>> >> strictly arranges for caches not to contain dirty data upon
>> >> completion of the routine. For example, non-temporal stores on older
>> >> x86 cpus could potentially leave dirty data in the cache, so
>> >> memcpy_wt on those cpus would need to use explicit cache flushing.
>> >
>> > I see.  I agree that its behavior is different from the existing one
>> > with "_nocache".   That said, I think "wt" or "write-through" generally
>> > means that writes allocate cachelines and keep them clean by writing to
>> > memory.  So, subsequent reads to the destination will hit the
>> > cachelines.  This is not the case with this interface.
>>
>> True... maybe _nocache_strict()? Or, leave it _wt() until someone
>> comes along and is surprised that the cache is not warm for reads
>> after memcpy_wt(), at which point we can ask "why not just use plain
>> memcpy then?", or set the page-attributes to WT.
>
> Perhaps a _nocache_flush() postfix, to signal both that it's non-temporal and that
> no cache line is left around afterwards (dirty or clean)?

Yes, I think "flush" belongs in the name, and to make it easily
grep-able separate from _nocache we can call it _flushcache? An
efficient implementation will use _nocache / non-temporal stores
internally, but external consumers just care about the state of the
cache after the call.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-05-06 13:57                 ` Dan Williams
@ 2017-05-07  8:57                   ` Ingo Molnar
  2017-05-08  3:01                     ` Kani, Toshimitsu
  0 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2017-05-07  8:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kani, Toshimitsu, linux-kernel@vger.kernel.org,
	linux-block@vger.kernel.org, jmoyer@redhat.com,
	tglx@linutronix.de, hch@lst.de, viro@zeniv.linux.org.uk,
	x86@kernel.org, mawilcox@microsoft.com, hpa@zytor.com,
	linux-nvdimm@lists.01.org, mingo@redhat.com,
	linux-fsdevel@vger.kernel.org, ross.zwisler@linux.intel.com,
	jack@suse.cz


* Dan Williams <dan.j.williams@intel.com> wrote:

> On Sat, May 6, 2017 at 2:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Dan Williams <dan.j.williams@intel.com> wrote:
> >
> >> On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> >> > On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
> >> >> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> >> >> wrote:
> >> >  :
> >> >> > > ---
> >> >> > > Changes since the initial RFC:
> >> >> > > * s/writethru/wt/ since we already have ioremap_wt(),
> >> >> > > set_memory_wt(), etc. (Ingo)
> >> >> >
> >> >> > Sorry I should have said earlier, but I think the term "wt" is
> >> >> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
> >> >> > semantics, not WT semantics.
> >> >>
> >> >> The non-temporal stores do, but memcpy_wt() is using a combination of
> >> >> non-temporal stores and explicit cache flushing.
> >> >>
> >> >> > How about using "nocache" as it's been
> >> >> > used in __copy_user_nocache()?
> >> >>
> >> >> The difference in my mind is that the "_nocache" suffix indicates
> >> >> opportunistic / optional cache pollution avoidance whereas "_wt"
> >> >> strictly arranges for caches not to contain dirty data upon
> >> >> completion of the routine. For example, non-temporal stores on older
> >> >> x86 cpus could potentially leave dirty data in the cache, so
> >> >> memcpy_wt on those cpus would need to use explicit cache flushing.
> >> >
> >> > I see.  I agree that its behavior is different from the existing one
> >> > with "_nocache".   That said, I think "wt" or "write-through" generally
> >> > means that writes allocate cachelines and keep them clean by writing to
> >> > memory.  So, subsequent reads to the destination will hit the
> >> > cachelines.  This is not the case with this interface.
> >>
> >> True... maybe _nocache_strict()? Or, leave it _wt() until someone
> >> comes along and is surprised that the cache is not warm for reads
> >> after memcpy_wt(), at which point we can ask "why not just use plain
> >> memcpy then?", or set the page-attributes to WT.
> >
> > Perhaps a _nocache_flush() postfix, to signal both that it's non-temporal and that
> > no cache line is left around afterwards (dirty or clean)?
> 
> Yes, I think "flush" belongs in the name, and to make it easily
> grep-able separate from _nocache we can call it _flushcache? An
> efficient implementation will use _nocache / non-temporal stores
> internally, but external consumers just care about the state of the
> cache after the call.

_flushcache() works for me too.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-05-07  8:57                   ` Ingo Molnar
@ 2017-05-08  3:01                     ` Kani, Toshimitsu
  0 siblings, 0 replies; 16+ messages in thread
From: Kani, Toshimitsu @ 2017-05-08  3:01 UTC (permalink / raw)
  To: Ingo Molnar, Dan Williams
  Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	jmoyer@redhat.com, tglx@linutronix.de, hch@lst.de,
	viro@zeniv.linux.org.uk, x86@kernel.org, mawilcox@microsoft.com,
	hpa@zytor.com, linux-nvdimm@lists.01.org, mingo@redhat.com,
	linux-fsdevel@vger.kernel.org, ross.zwisler@linux.intel.com,
	jack@suse.cz

> * Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > On Sat, May 6, 2017 at 2:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
> > >
> > > * Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > >> On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> wrote:
> > >> > On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
> > >> >> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu
> <toshi.kani@hpe.com>
> > >> >> wrote:
> > >> >  :
> > >> >> > > ---
> > >> >> > > Changes since the initial RFC:
> > >> >> > > * s/writethru/wt/ since we already have ioremap_wt(),
> > >> >> > > set_memory_wt(), etc. (Ingo)
> > >> >> >
> > >> >> > Sorry I should have said earlier, but I think the term "wt" is
> > >> >> > misleading.  Non-temporal stores used in memcpy_wt() provide WC
> > >> >> > semantics, not WT semantics.
> > >> >>
> > >> >> The non-temporal stores do, but memcpy_wt() is using a combination
> of
> > >> >> non-temporal stores and explicit cache flushing.
> > >> >>
> > >> >> > How about using "nocache" as it's been
> > >> >> > used in __copy_user_nocache()?
> > >> >>
> > >> >> The difference in my mind is that the "_nocache" suffix indicates
> > >> >> opportunistic / optional cache pollution avoidance whereas "_wt"
> > >> >> strictly arranges for caches not to contain dirty data upon
> > >> >> completion of the routine. For example, non-temporal stores on older
> > >> >> x86 cpus could potentially leave dirty data in the cache, so
> > >> >> memcpy_wt on those cpus would need to use explicit cache flushing.
> > >> >
> > >> > I see.  I agree that its behavior is different from the existing one
> > >> > with "_nocache".   That said, I think "wt" or "write-through" generally
> > >> > means that writes allocate cachelines and keep them clean by writing
> to
> > >> > memory.  So, subsequent reads to the destination will hit the
> > >> > cachelines.  This is not the case with this interface.
> > >>
> > >> True... maybe _nocache_strict()? Or, leave it _wt() until someone
> > >> comes along and is surprised that the cache is not warm for reads
> > >> after memcpy_wt(), at which point we can ask "why not just use plain
> > >> memcpy then?", or set the page-attributes to WT.
> > >
> > > Perhaps a _nocache_flush() postfix, to signal both that it's non-temporal
> and that
> > > no cache line is left around afterwards (dirty or clean)?
> >
> > Yes, I think "flush" belongs in the name, and to make it easily
> > grep-able separate from _nocache we can call it _flushcache? An
> > efficient implementation will use _nocache / non-temporal stores
> > internally, but external consumers just care about the state of the
> > cache after the call.
> 
> _flushcache() works for me too.
> 

Works for me too.
Thanks,
-Toshi

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-04-28 19:39     ` [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations Dan Williams
  2017-05-05  6:54       ` Ingo Molnar
  2017-05-05 20:39       ` Kani, Toshimitsu
@ 2017-05-08 20:32       ` Ross Zwisler
  2017-05-08 20:40         ` Dan Williams
  2 siblings, 1 reply; 16+ messages in thread
From: Ross Zwisler @ 2017-05-08 20:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: viro, Jan Kara, Matthew Wilcox, x86, linux-kernel, hch,
	linux-block, linux-nvdimm, jmoyer, Ingo Molnar, H. Peter Anvin,
	linux-fsdevel, Thomas Gleixner, ross.zwisler

On Fri, Apr 28, 2017 at 12:39:12PM -0700, Dan Williams wrote:
> The pmem driver has a need to transfer data with a persistent memory
> destination and be able to rely on the fact that the destination writes
> are not cached. It is sufficient for the writes to be flushed to a
> cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
> userspace to call fsync() to ensure data-writes have reached a
> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
> REQ_FLUSH to the pmem driver which will turn around and fence previous
> writes with an "sfence".
> 
> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
> that guarantee that the destination buffer is not dirty in the cpu cache
> on completion. The new copy_from_iter_wt and sub-routines will be used
> to replace the "pmem api" (include/linux/pmem.h +
> arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
> and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
> symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
> otherwise.
> 
> This is meant to satisfy the concern from Linus that if a driver wants
> to do something beyond the normal nocache semantics it should be
> something private to that driver [1], and Al's concern that anything
> uaccess related belongs with the rest of the uaccess code [2].
> 
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
> 
> Cc: <x86@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
<>
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index c5504b9a472e..07ded30c7e89 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -171,6 +171,10 @@ unsigned long raw_copy_in_user(void __user *dst, const void __user *src, unsigne
>  extern long __copy_user_nocache(void *dst, const void __user *src,
>  				unsigned size, int zerorest);
>  
> +extern long __copy_user_wt(void *dst, const void __user *src, unsigned size);
> +extern void memcpy_page_wt(char *to, struct page *page, size_t offset,
> +			   size_t len);
> +
>  static inline int
>  __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
>  				  unsigned size)
> @@ -179,6 +183,13 @@ __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
>  	return __copy_user_nocache(dst, src, size, 0);
>  }
>  
> +static inline int
> +__copy_from_user_inatomic_wt(void *dst, const void __user *src, unsigned size)
> +{
> +	kasan_check_write(dst, size);
> +	return __copy_user_wt(dst, src, size);
> +}
> +
>  unsigned long
>  copy_user_handle_tail(char *to, char *from, unsigned len);
>  
> diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
> index 3b7c40a2e3e1..0aeff66a022f 100644
> --- a/arch/x86/lib/usercopy_64.c
> +++ b/arch/x86/lib/usercopy_64.c
> @@ -7,6 +7,7 @@
>   */
>  #include <linux/export.h>
>  #include <linux/uaccess.h>
> +#include <linux/highmem.h>
>  
>  /*
>   * Zero Userspace
> @@ -73,3 +74,130 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
>  	clac();
>  	return len;
>  }
> +
> +#ifdef CONFIG_ARCH_HAS_UACCESS_WT
> +/**
> + * clean_cache_range - write back a cache range with CLWB
> + * @vaddr:	virtual start address
> + * @size:	number of bytes to write back
> + *
> + * Write back a cache range using the CLWB (cache line write back)
> + * instruction. Note that @size is internally rounded up to be cache
> + * line size aligned.
> + */
> +static void clean_cache_range(void *addr, size_t size)
> +{
> +	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
> +	unsigned long clflush_mask = x86_clflush_size - 1;
> +	void *vend = addr + size;
> +	void *p;
> +
> +	for (p = (void *)((unsigned long)addr & ~clflush_mask);
> +	     p < vend; p += x86_clflush_size)
> +		clwb(p);
> +}
> +
> +long __copy_user_wt(void *dst, const void __user *src, unsigned size)
> +{
> +	unsigned long flushed, dest = (unsigned long) dst;
> +	long rc = __copy_user_nocache(dst, src, size, 0);
> +
> +	/*
> +	 * __copy_user_nocache() uses non-temporal stores for the bulk
> +	 * of the transfer, but we need to manually flush if the
> +	 * transfer is unaligned. A cached memory copy is used when
> +	 * destination or size is not naturally aligned. That is:
> +	 *   - Require 8-byte alignment when size is 8 bytes or larger.
> +	 *   - Require 4-byte alignment when size is 4 bytes.
> +	 */
> +	if (size < 8) {
> +		if (!IS_ALIGNED(dest, 4) || size != 4)
> +			clean_cache_range(dst, 1);
> +	} else {
> +		if (!IS_ALIGNED(dest, 8)) {
> +			dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
> +			clean_cache_range(dst, 1);
> +		}
> +
> +		flushed = dest - (unsigned long) dst;
> +		if (size > flushed && !IS_ALIGNED(size - flushed, 8))
> +			clean_cache_range(dst + size - 1, 1);
> +	}
> +
> +	return rc;
> +}
> +
> +void memcpy_wt(void *_dst, const void *_src, size_t size)
> +{
> +	unsigned long dest = (unsigned long) _dst;
> +	unsigned long source = (unsigned long) _src;
> +
> +	/* cache copy and flush to align dest */
> +	if (!IS_ALIGNED(dest, 8)) {
> +		unsigned len = min_t(unsigned, size, ALIGN(dest, 8) - dest);
> +
> +		memcpy((void *) dest, (void *) source, len);
> +		clean_cache_range((void *) dest, len);
> +		dest += len;
> +		source += len;
> +		size -= len;
> +		if (!size)
> +			return;
> +	}
> +
> +	/* 4x8 movnti loop */
> +	while (size >= 32) {
> +		asm("movq    (%0), %%r8\n"
> +		    "movq   8(%0), %%r9\n"
> +		    "movq  16(%0), %%r10\n"
> +		    "movq  24(%0), %%r11\n"
> +		    "movnti  %%r8,   (%1)\n"
> +		    "movnti  %%r9,  8(%1)\n"
> +		    "movnti %%r10, 16(%1)\n"
> +		    "movnti %%r11, 24(%1)\n"
> +		    :: "r" (source), "r" (dest)
> +		    : "memory", "r8", "r9", "r10", "r11");
> +		dest += 32;
> +		source += 32;
> +		size -= 32;
> +	}
> +
> +	/* 1x8 movnti loop */
> +	while (size >= 8) {
> +		asm("movq    (%0), %%r8\n"
> +		    "movnti  %%r8,   (%1)\n"
> +		    :: "r" (source), "r" (dest)
> +		    : "memory", "r8");
> +		dest += 8;
> +		source += 8;
> +		size -= 8;
> +	}
> +
> +	/* 1x4 movnti loop */
> +	while (size >= 4) {
> +		asm("movl    (%0), %%r8d\n"
> +		    "movnti  %%r8d,   (%1)\n"
> +		    :: "r" (source), "r" (dest)
> +		    : "memory", "r8");
> +		dest += 4;
> +		source += 4;
> +		size -= 4;
> +	}
> +
> +	/* cache copy for remaining bytes */
> +	if (size) {
> +		memcpy((void *) dest, (void *) source, size);
> +		clean_cache_range((void *) dest, size);
> +	}
> +}
> +EXPORT_SYMBOL_GPL(memcpy_wt);

I took a pretty hard look at the changes in arch/x86/lib/usercopy_64.c, and
they look correct to me.  The inline assembly for non-temporal copies mixed
with C for loop control is IMHO much easier to follow than the pure assembly
of __copy_user_nocache().

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
  2017-05-08 20:32       ` Ross Zwisler
@ 2017-05-08 20:40         ` Dan Williams
  0 siblings, 0 replies; 16+ messages in thread
From: Dan Williams @ 2017-05-08 20:40 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, Al Viro, Jan Kara, Matthew Wilcox,
	X86 ML, linux-kernel@vger.kernel.org, Christoph Hellwig,
	linux-block, linux-nvdimm@lists.01.org, jmoyer, Ingo Molnar,
	H. Peter Anvin, linux-fsdevel, Thomas Gleixner

On Mon, May 8, 2017 at 1:32 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Fri, Apr 28, 2017 at 12:39:12PM -0700, Dan Williams wrote:
>> The pmem driver has a need to transfer data with a persistent memory
>> destination and be able to rely on the fact that the destination writes
>> are not cached. It is sufficient for the writes to be flushed to a
>> cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
>> userspace to call fsync() to ensure data-writes have reached a
>> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
>> REQ_FLUSH to the pmem driver which will turn around and fence previous
>> writes with an "sfence".
>>
>> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
>> that guarantee that the destination buffer is not dirty in the cpu cache
>> on completion. The new copy_from_iter_wt and sub-routines will be used
>> to replace the "pmem api" (include/linux/pmem.h +
>> arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
>> and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
>> symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
>> otherwise.
>>
>> This is meant to satisfy the concern from Linus that if a driver wants
>> to do something beyond the normal nocache semantics it should be
>> something private to that driver [1], and Al's concern that anything
>> uaccess related belongs with the rest of the uaccess code [2].
>>
>> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
>> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
>>
>> Cc: <x86@kernel.org>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
[..]
> I took a pretty hard look at the changes in arch/x86/lib/usercopy_64.c, and
> they look correct to me.  The inline assembly for non-temporal copies mixed
> with C for loop control is IMHO much easier to follow than the pure assembly
> of __copy_user_nocache().
>
> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Thanks Ross, I appreciate it.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-05-08 20:40 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20170425012230.GX29622@ZenIV.linux.org.uk>
2017-04-26 21:56 ` [RFC PATCH] x86, uaccess, pmem: introduce copy_from_iter_writethru for dax + pmem Dan Williams
2017-04-27  6:30   ` Ingo Molnar
2017-04-28 19:39     ` [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations Dan Williams
2017-05-05  6:54       ` Ingo Molnar
2017-05-05 14:12         ` Dan Williams
2017-05-05 20:39       ` Kani, Toshimitsu
2017-05-05 22:25         ` Dan Williams
2017-05-05 22:44           ` Kani, Toshimitsu
2017-05-06  2:15             ` Dan Williams
2017-05-06  3:17               ` Kani, Toshimitsu
2017-05-06  9:46               ` Ingo Molnar
2017-05-06 13:57                 ` Dan Williams
2017-05-07  8:57                   ` Ingo Molnar
2017-05-08  3:01                     ` Kani, Toshimitsu
2017-05-08 20:32       ` Ross Zwisler
2017-05-08 20:40         ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).