* [RFC PATCH] x86, uaccess, pmem: introduce copy_from_iter_writethru for dax + pmem
[not found] <20170425012230.GX29622@ZenIV.linux.org.uk>
@ 2017-04-26 21:56 ` Dan Williams
2017-04-27 6:30 ` Ingo Molnar
0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2017-04-26 21:56 UTC (permalink / raw)
To: viro
Cc: Jan Kara, Matthew Wilcox, x86, linux-kernel, hch, linux-block,
linux-nvdimm, Jeff Moyer, Ingo Molnar, H. Peter Anvin,
linux-fsdevel, Thomas Gleixner, Ross Zwisler
The pmem driver has a need to transfer data with a persistent memory
destination and be able to rely on the fact that the destination writes
are not cached. It is sufficient for the writes to be flushed to a
cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
userspace to call fsync() to ensure data-writes have reached a
power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
REQ_FLUSH to the pmem driver which will turn around and fence previous
writes with an "sfence".
Implement a __copy_from_user_inatomic_writethru, memcpy_page_writethru,
and memcpy_writethru, that guarantee that the destination buffer is not
dirty in the cpu cache on completion. The new copy_from_iter_writethru
and sub-routines will be used to replace the "pmem api"
(include/linux/pmem.h + arch/x86/include/asm/pmem.h). The availability
of copy_from_iter_writethru() and memcpy_writethru() are gated by the
CONFIG_ARCH_HAS_UACCESS_WRITETHRU config symbol, and fallback to
copy_from_iter_nocache() and plain memcpy() otherwise.
This is meant to satisfy the concern from Linus that if a driver wants
to do something beyond the normal nocache semantics it should be
something private to that driver [1], and Al's concern that anything
uaccess related belongs with the rest of the uaccess code [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
This patch is based on a merge of vfs.git/for-next and
nvdimm.git/libnvdimm-for-next.
arch/x86/Kconfig | 1
arch/x86/include/asm/string_64.h | 5 +
arch/x86/include/asm/uaccess_64.h | 13 ++++
arch/x86/lib/usercopy_64.c | 128 +++++++++++++++++++++++++++++++++++++
drivers/acpi/nfit/core.c | 2 -
drivers/nvdimm/claim.c | 2 -
drivers/nvdimm/pmem.c | 13 +++-
drivers/nvdimm/region_devs.c | 2 -
include/linux/dax.h | 3 +
include/linux/string.h | 6 ++
include/linux/uio.h | 15 ++++
lib/Kconfig | 3 +
lib/iov_iter.c | 22 ++++++
13 files changed, 210 insertions(+), 5 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1d50fdff77ee..bd3ff407d707 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,6 +54,7 @@ config X86
select ARCH_HAS_KCOV if X86_64
select ARCH_HAS_MMIO_FLUSH
select ARCH_HAS_PMEM_API if X86_64
+ select ARCH_HAS_UACCESS_WRITETHRU if X86_64
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 733bae07fb29..60173bc51603 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -109,6 +109,11 @@ memcpy_mcsafe(void *dst, const void *src, size_t cnt)
return 0;
}
+#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
+#define __HAVE_ARCH_MEMCPY_WRITETHRU 1
+void memcpy_writethru(void *dst, const void *src, size_t cnt);
+#endif
+
#endif /* __KERNEL__ */
#endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index c5504b9a472e..748e8a50e4b3 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -171,6 +171,11 @@ unsigned long raw_copy_in_user(void __user *dst, const void __user *src, unsigne
extern long __copy_user_nocache(void *dst, const void __user *src,
unsigned size, int zerorest);
+extern long __copy_user_writethru(void *dst, const void __user *src,
+ unsigned size);
+extern void memcpy_page_writethru(char *to, struct page *page, size_t offset,
+ size_t len);
+
static inline int
__copy_from_user_inatomic_nocache(void *dst, const void __user *src,
unsigned size)
@@ -179,6 +184,14 @@ __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
return __copy_user_nocache(dst, src, size, 0);
}
+static inline int
+__copy_from_user_inatomic_writethru(void *dst, const void __user *src,
+ unsigned size)
+{
+ kasan_check_write(dst, size);
+ return __copy_user_writethru(dst, src, size);
+}
+
unsigned long
copy_user_handle_tail(char *to, char *from, unsigned len);
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 3b7c40a2e3e1..144cb5e59193 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -7,6 +7,7 @@
*/
#include <linux/export.h>
#include <linux/uaccess.h>
+#include <linux/highmem.h>
/*
* Zero Userspace
@@ -73,3 +74,130 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
clac();
return len;
}
+
+#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
+/**
+ * clean_cache_range - write back a cache range with CLWB
+ * @vaddr: virtual start address
+ * @size: number of bytes to write back
+ *
+ * Write back a cache range using the CLWB (cache line write back)
+ * instruction. Note that @size is internally rounded up to be cache
+ * line size aligned.
+ */
+static void clean_cache_range(void *addr, size_t size)
+{
+ u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
+ unsigned long clflush_mask = x86_clflush_size - 1;
+ void *vend = addr + size;
+ void *p;
+
+ for (p = (void *)((unsigned long)addr & ~clflush_mask);
+ p < vend; p += x86_clflush_size)
+ clwb(p);
+}
+
+long __copy_user_writethru(void *dst, const void __user *src, unsigned size)
+{
+ unsigned long flushed, dest = (unsigned long) dst;
+ long rc = __copy_user_nocache(dst, src, size, 0);
+
+ /*
+ * __copy_user_nocache() uses non-temporal stores for the bulk
+ * of the transfer, but we need to manually flush if the
+ * transfer is unaligned. A cached memory copy is used when
+ * destination or size is not naturally aligned. That is:
+ * - Require 8-byte alignment when size is 8 bytes or larger.
+ * - Require 4-byte alignment when size is 4 bytes.
+ */
+ if (size < 8) {
+ if (!IS_ALIGNED(dest, 4) || size != 4)
+ clean_cache_range(dst, 1);
+ } else {
+ if (!IS_ALIGNED(dest, 8)) {
+ dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
+ clean_cache_range(dst, 1);
+ }
+
+ flushed = dest - (unsigned long) dst;
+ if (size > flushed && !IS_ALIGNED(size - flushed, 8))
+ clean_cache_range(dst + size - 1, 1);
+ }
+
+ return rc;
+}
+
+void memcpy_writethru(void *_dst, const void *_src, size_t size)
+{
+ unsigned long dest = (unsigned long) _dst;
+ unsigned long source = (unsigned long) _src;
+
+ /* cache copy and flush to align dest */
+ if (!IS_ALIGNED(dest, 8)) {
+ unsigned len = min_t(unsigned, size, ALIGN(dest, 8) - dest);
+
+ memcpy((void *) dest, (void *) source, len);
+ clean_cache_range((void *) dest, len);
+ dest += len;
+ source += len;
+ size -= len;
+ if (!size)
+ return;
+ }
+
+ /* 4x8 movnti loop */
+ while (size >= 32) {
+ asm("movq (%0), %%r8\n"
+ "movq 8(%0), %%r9\n"
+ "movq 16(%0), %%r10\n"
+ "movq 24(%0), %%r11\n"
+ "movnti %%r8, (%1)\n"
+ "movnti %%r9, 8(%1)\n"
+ "movnti %%r10, 16(%1)\n"
+ "movnti %%r11, 24(%1)\n"
+ :: "r" (source), "r" (dest)
+ : "memory", "r8", "r9", "r10", "r11");
+ dest += 32;
+ source += 32;
+ size -= 32;
+ }
+
+ /* 1x8 movnti loop */
+ while (size >= 8) {
+ asm("movq (%0), %%r8\n"
+ "movnti %%r8, (%1)\n"
+ :: "r" (source), "r" (dest)
+ : "memory", "r8");
+ dest += 8;
+ source += 8;
+ size -= 8;
+ }
+
+ /* 1x4 movnti loop */
+ while (size >= 4) {
+ asm("movl (%0), %%r8d\n"
+ "movnti %%r8d, (%1)\n"
+ :: "r" (source), "r" (dest)
+ : "memory", "r8");
+ dest += 4;
+ source += 4;
+ size -= 4;
+ }
+
+ /* cache copy for remaining bytes */
+ if (size) {
+ memcpy((void *) dest, (void *) source, size);
+ clean_cache_range((void *) dest, size);
+ }
+}
+EXPORT_SYMBOL_GPL(memcpy_writethru);
+
+void memcpy_page_writethru(char *to, struct page *page, size_t offset,
+ size_t len)
+{
+ char *from = kmap_atomic(page);
+
+ memcpy_writethru(to, from + offset, len);
+ kunmap_atomic(from);
+}
+#endif
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index d0c07b2344e4..c84e242f91ed 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -1776,7 +1776,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
}
if (rw)
- memcpy_to_pmem(mmio->addr.aperture + offset,
+ memcpy_writethru(mmio->addr.aperture + offset,
iobuf + copied, c);
else {
if (nfit_blk->dimm_flags & NFIT_BLK_READ_FLUSH)
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index 3a35e8028b9c..38822f6fa49f 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -266,7 +266,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
rc = -EIO;
}
- memcpy_to_pmem(nsio->addr + offset, buf, size);
+ memcpy_writethru(nsio->addr + offset, buf, size);
nvdimm_flush(to_nd_region(ndns->dev.parent));
return rc;
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3b3dab73d741..28dc82a595a5 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -28,6 +28,7 @@
#include <linux/pfn_t.h>
#include <linux/slab.h>
#include <linux/pmem.h>
+#include <linux/uio.h>
#include <linux/dax.h>
#include <linux/nd.h>
#include "pmem.h"
@@ -79,7 +80,7 @@ static void write_pmem(void *pmem_addr, struct page *page,
{
void *mem = kmap_atomic(page);
- memcpy_to_pmem(pmem_addr, mem + off, len);
+ memcpy_writethru(pmem_addr, mem + off, len);
kunmap_atomic(mem);
}
@@ -234,8 +235,15 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
}
+static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
+ void *addr, size_t bytes, struct iov_iter *i)
+{
+ return copy_from_iter_writethru(addr, bytes, i);
+}
+
static const struct dax_operations pmem_dax_ops = {
.direct_access = pmem_dax_direct_access,
+ .copy_from_iter = pmem_copy_from_iter,
};
static void pmem_release_queue(void *q)
@@ -288,7 +296,8 @@ static int pmem_attach_disk(struct device *dev,
dev_set_drvdata(dev, pmem);
pmem->phys_addr = res->start;
pmem->size = resource_size(res);
- if (nvdimm_has_flush(nd_region) < 0)
+ if (!IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_WRITETHRU)
+ || nvdimm_has_flush(nd_region) < 0)
dev_warn(dev, "unable to guarantee persistence of writes\n");
if (!devm_request_mem_region(dev, res->start, resource_size(res),
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b7cb5066d961..b668ba455c39 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -947,7 +947,7 @@ void nvdimm_flush(struct nd_region *nd_region)
* The first wmb() is needed to 'sfence' all previous writes
* such that they are architecturally visible for the platform
* buffer flush. Note that we've already arranged for pmem
- * writes to avoid the cache via arch_memcpy_to_pmem(). The
+ * writes to avoid the cache via memcpy_writethru(). The
* final wmb() ensures ordering for the NVDIMM flush write.
*/
wmb();
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d3158e74a59e..156f067d4db5 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -16,6 +16,9 @@ struct dax_operations {
*/
long (*direct_access)(struct dax_device *, pgoff_t, long,
void **, pfn_t *);
+ /* copy_from_iter: dax-driver override for default copy_from_iter */
+ size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
+ struct iov_iter *);
};
int dax_read_lock(void);
diff --git a/include/linux/string.h b/include/linux/string.h
index 9d6f189157e2..f4e166d88e2a 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -122,6 +122,12 @@ static inline __must_check int memcpy_mcsafe(void *dst, const void *src,
return 0;
}
#endif
+#ifndef __HAVE_ARCH_MEMCPY_WRITETHRU
+static inline void memcpy_writethru(void *dst, const void *src, size_t cnt)
+{
+ memcpy(dst, src, cnt);
+}
+#endif
void *memchr_inv(const void *s, int c, size_t n);
char *strreplace(char *s, char old, char new);
diff --git a/include/linux/uio.h b/include/linux/uio.h
index f2d36a3d3005..d284cb5e89fa 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -95,6 +95,21 @@ size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
bool copy_from_iter_full(void *addr, size_t bytes, struct iov_iter *i);
size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i);
+#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
+/*
+ * Note, users like pmem that depend on the stricter semantics of
+ * copy_from_iter_writethru() than copy_from_iter_nocache() must check
+ * for IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_WRITETHROUGH) before assuming
+ * that the destination is flushed from the cache on return.
+ */
+size_t copy_from_iter_writethru(void *addr, size_t bytes, struct iov_iter *i);
+#else
+static inline size_t copy_from_iter_writethru(void *addr, size_t bytes,
+ struct iov_iter *i)
+{
+ return copy_from_iter_nocache(addr, bytes, i);
+}
+#endif
bool copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i);
size_t iov_iter_zero(size_t bytes, struct iov_iter *);
unsigned long iov_iter_alignment(const struct iov_iter *i);
diff --git a/lib/Kconfig b/lib/Kconfig
index 0c8b78a9ae2e..db31bc186df2 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -548,6 +548,9 @@ config ARCH_HAS_SG_CHAIN
config ARCH_HAS_PMEM_API
bool
+config ARCH_HAS_UACCESS_WRITETHRU
+ bool
+
config ARCH_HAS_MMIO_FLUSH
bool
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f7c93568ec99..afc3dc75346c 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -615,6 +615,28 @@ size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
}
EXPORT_SYMBOL(copy_from_iter_nocache);
+#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
+size_t copy_from_iter_writethru(void *addr, size_t bytes, struct iov_iter *i)
+{
+ char *to = addr;
+ if (unlikely(i->type & ITER_PIPE)) {
+ WARN_ON(1);
+ return 0;
+ }
+ iterate_and_advance(i, bytes, v,
+ __copy_from_user_inatomic_writethru((to += v.iov_len) - v.iov_len,
+ v.iov_base, v.iov_len),
+ memcpy_page_writethru((to += v.bv_len) - v.bv_len, v.bv_page,
+ v.bv_offset, v.bv_len),
+ memcpy_writethru((to += v.iov_len) - v.iov_len, v.iov_base,
+ v.iov_len)
+ )
+
+ return bytes;
+}
+EXPORT_SYMBOL_GPL(copy_from_iter_writethru);
+#endif
+
bool copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i)
{
char *to = addr;
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [RFC PATCH] x86, uaccess, pmem: introduce copy_from_iter_writethru for dax + pmem
2017-04-26 21:56 ` [RFC PATCH] x86, uaccess, pmem: introduce copy_from_iter_writethru for dax + pmem Dan Williams
@ 2017-04-27 6:30 ` Ingo Molnar
2017-04-28 19:39 ` [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations Dan Williams
0 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2017-04-27 6:30 UTC (permalink / raw)
To: Dan Williams
Cc: viro, Jan Kara, Matthew Wilcox, x86, linux-kernel, hch,
linux-block, linux-nvdimm, Jeff Moyer, Ingo Molnar,
H. Peter Anvin, linux-fsdevel, Thomas Gleixner, Ross Zwisler
* Dan Williams <dan.j.williams@intel.com> wrote:
> +#ifdef CONFIG_ARCH_HAS_UACCESS_WRITETHRU
> +#define __HAVE_ARCH_MEMCPY_WRITETHRU 1
> +void memcpy_writethru(void *dst, const void *src, size_t cnt);
> +#endif
This should be named memcpy_wt(), which is the well-known postfix for
write-through.
We already have ioremap_wt(), set_memory_wt(), etc. - no need to introduce a
longer variant with uncommon spelling.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
2017-04-27 6:30 ` Ingo Molnar
@ 2017-04-28 19:39 ` Dan Williams
2017-05-05 6:54 ` Ingo Molnar
` (2 more replies)
0 siblings, 3 replies; 16+ messages in thread
From: Dan Williams @ 2017-04-28 19:39 UTC (permalink / raw)
To: viro
Cc: Jan Kara, Matthew Wilcox, x86, linux-kernel, hch, linux-block,
linux-nvdimm, jmoyer, Ingo Molnar, H. Peter Anvin, linux-fsdevel,
Thomas Gleixner, ross.zwisler
The pmem driver has a need to transfer data with a persistent memory
destination and be able to rely on the fact that the destination writes
are not cached. It is sufficient for the writes to be flushed to a
cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
userspace to call fsync() to ensure data-writes have reached a
power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
REQ_FLUSH to the pmem driver which will turn around and fence previous
writes with an "sfence".
Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
that guarantee that the destination buffer is not dirty in the cpu cache
on completion. The new copy_from_iter_wt and sub-routines will be used
to replace the "pmem api" (include/linux/pmem.h +
arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
otherwise.
This is meant to satisfy the concern from Linus that if a driver wants
to do something beyond the normal nocache semantics it should be
something private to that driver [1], and Al's concern that anything
uaccess related belongs with the rest of the uaccess code [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Changes since the initial RFC:
* s/writethru/wt/ since we already have ioremap_wt(), set_memory_wt(),
etc. (Ingo)
arch/x86/Kconfig | 1
arch/x86/include/asm/string_64.h | 5 +
arch/x86/include/asm/uaccess_64.h | 11 +++
arch/x86/lib/usercopy_64.c | 128 +++++++++++++++++++++++++++++++++++++
drivers/acpi/nfit/core.c | 3 -
drivers/nvdimm/claim.c | 2 -
drivers/nvdimm/pmem.c | 13 +++-
drivers/nvdimm/region_devs.c | 4 +
include/linux/dax.h | 3 +
include/linux/string.h | 6 ++
include/linux/uio.h | 15 ++++
lib/Kconfig | 3 +
lib/iov_iter.c | 21 ++++++
13 files changed, 208 insertions(+), 7 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1d50fdff77ee..398117923b1c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,6 +54,7 @@ config X86
select ARCH_HAS_KCOV if X86_64
select ARCH_HAS_MMIO_FLUSH
select ARCH_HAS_PMEM_API if X86_64
+ select ARCH_HAS_UACCESS_WT if X86_64
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 733bae07fb29..dfbd66b11c72 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -109,6 +109,11 @@ memcpy_mcsafe(void *dst, const void *src, size_t cnt)
return 0;
}
+#ifdef CONFIG_ARCH_HAS_UACCESS_WT
+#define __HAVE_ARCH_MEMCPY_WT 1
+void memcpy_wt(void *dst, const void *src, size_t cnt);
+#endif
+
#endif /* __KERNEL__ */
#endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index c5504b9a472e..07ded30c7e89 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -171,6 +171,10 @@ unsigned long raw_copy_in_user(void __user *dst, const void __user *src, unsigne
extern long __copy_user_nocache(void *dst, const void __user *src,
unsigned size, int zerorest);
+extern long __copy_user_wt(void *dst, const void __user *src, unsigned size);
+extern void memcpy_page_wt(char *to, struct page *page, size_t offset,
+ size_t len);
+
static inline int
__copy_from_user_inatomic_nocache(void *dst, const void __user *src,
unsigned size)
@@ -179,6 +183,13 @@ __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
return __copy_user_nocache(dst, src, size, 0);
}
+static inline int
+__copy_from_user_inatomic_wt(void *dst, const void __user *src, unsigned size)
+{
+ kasan_check_write(dst, size);
+ return __copy_user_wt(dst, src, size);
+}
+
unsigned long
copy_user_handle_tail(char *to, char *from, unsigned len);
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 3b7c40a2e3e1..0aeff66a022f 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -7,6 +7,7 @@
*/
#include <linux/export.h>
#include <linux/uaccess.h>
+#include <linux/highmem.h>
/*
* Zero Userspace
@@ -73,3 +74,130 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
clac();
return len;
}
+
+#ifdef CONFIG_ARCH_HAS_UACCESS_WT
+/**
+ * clean_cache_range - write back a cache range with CLWB
+ * @vaddr: virtual start address
+ * @size: number of bytes to write back
+ *
+ * Write back a cache range using the CLWB (cache line write back)
+ * instruction. Note that @size is internally rounded up to be cache
+ * line size aligned.
+ */
+static void clean_cache_range(void *addr, size_t size)
+{
+ u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
+ unsigned long clflush_mask = x86_clflush_size - 1;
+ void *vend = addr + size;
+ void *p;
+
+ for (p = (void *)((unsigned long)addr & ~clflush_mask);
+ p < vend; p += x86_clflush_size)
+ clwb(p);
+}
+
+long __copy_user_wt(void *dst, const void __user *src, unsigned size)
+{
+ unsigned long flushed, dest = (unsigned long) dst;
+ long rc = __copy_user_nocache(dst, src, size, 0);
+
+ /*
+ * __copy_user_nocache() uses non-temporal stores for the bulk
+ * of the transfer, but we need to manually flush if the
+ * transfer is unaligned. A cached memory copy is used when
+ * destination or size is not naturally aligned. That is:
+ * - Require 8-byte alignment when size is 8 bytes or larger.
+ * - Require 4-byte alignment when size is 4 bytes.
+ */
+ if (size < 8) {
+ if (!IS_ALIGNED(dest, 4) || size != 4)
+ clean_cache_range(dst, 1);
+ } else {
+ if (!IS_ALIGNED(dest, 8)) {
+ dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
+ clean_cache_range(dst, 1);
+ }
+
+ flushed = dest - (unsigned long) dst;
+ if (size > flushed && !IS_ALIGNED(size - flushed, 8))
+ clean_cache_range(dst + size - 1, 1);
+ }
+
+ return rc;
+}
+
+void memcpy_wt(void *_dst, const void *_src, size_t size)
+{
+ unsigned long dest = (unsigned long) _dst;
+ unsigned long source = (unsigned long) _src;
+
+ /* cache copy and flush to align dest */
+ if (!IS_ALIGNED(dest, 8)) {
+ unsigned len = min_t(unsigned, size, ALIGN(dest, 8) - dest);
+
+ memcpy((void *) dest, (void *) source, len);
+ clean_cache_range((void *) dest, len);
+ dest += len;
+ source += len;
+ size -= len;
+ if (!size)
+ return;
+ }
+
+ /* 4x8 movnti loop */
+ while (size >= 32) {
+ asm("movq (%0), %%r8\n"
+ "movq 8(%0), %%r9\n"
+ "movq 16(%0), %%r10\n"
+ "movq 24(%0), %%r11\n"
+ "movnti %%r8, (%1)\n"
+ "movnti %%r9, 8(%1)\n"
+ "movnti %%r10, 16(%1)\n"
+ "movnti %%r11, 24(%1)\n"
+ :: "r" (source), "r" (dest)
+ : "memory", "r8", "r9", "r10", "r11");
+ dest += 32;
+ source += 32;
+ size -= 32;
+ }
+
+ /* 1x8 movnti loop */
+ while (size >= 8) {
+ asm("movq (%0), %%r8\n"
+ "movnti %%r8, (%1)\n"
+ :: "r" (source), "r" (dest)
+ : "memory", "r8");
+ dest += 8;
+ source += 8;
+ size -= 8;
+ }
+
+ /* 1x4 movnti loop */
+ while (size >= 4) {
+ asm("movl (%0), %%r8d\n"
+ "movnti %%r8d, (%1)\n"
+ :: "r" (source), "r" (dest)
+ : "memory", "r8");
+ dest += 4;
+ source += 4;
+ size -= 4;
+ }
+
+ /* cache copy for remaining bytes */
+ if (size) {
+ memcpy((void *) dest, (void *) source, size);
+ clean_cache_range((void *) dest, size);
+ }
+}
+EXPORT_SYMBOL_GPL(memcpy_wt);
+
+void memcpy_page_wt(char *to, struct page *page, size_t offset,
+ size_t len)
+{
+ char *from = kmap_atomic(page);
+
+ memcpy_wt(to, from + offset, len);
+ kunmap_atomic(from);
+}
+#endif
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index d0c07b2344e4..be9bba609f26 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -1776,8 +1776,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
}
if (rw)
- memcpy_to_pmem(mmio->addr.aperture + offset,
- iobuf + copied, c);
+ memcpy_wt(mmio->addr.aperture + offset, iobuf + copied, c);
else {
if (nfit_blk->dimm_flags & NFIT_BLK_READ_FLUSH)
mmio_flush_range((void __force *)
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index 3a35e8028b9c..864ed42baaf0 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -266,7 +266,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
rc = -EIO;
}
- memcpy_to_pmem(nsio->addr + offset, buf, size);
+ memcpy_wt(nsio->addr + offset, buf, size);
nvdimm_flush(to_nd_region(ndns->dev.parent));
return rc;
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3b3dab73d741..4be8f30de9b3 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -28,6 +28,7 @@
#include <linux/pfn_t.h>
#include <linux/slab.h>
#include <linux/pmem.h>
+#include <linux/uio.h>
#include <linux/dax.h>
#include <linux/nd.h>
#include "pmem.h"
@@ -79,7 +80,7 @@ static void write_pmem(void *pmem_addr, struct page *page,
{
void *mem = kmap_atomic(page);
- memcpy_to_pmem(pmem_addr, mem + off, len);
+ memcpy_wt(pmem_addr, mem + off, len);
kunmap_atomic(mem);
}
@@ -234,8 +235,15 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
}
+static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
+ void *addr, size_t bytes, struct iov_iter *i)
+{
+ return copy_from_iter_wt(addr, bytes, i);
+}
+
static const struct dax_operations pmem_dax_ops = {
.direct_access = pmem_dax_direct_access,
+ .copy_from_iter = pmem_copy_from_iter,
};
static void pmem_release_queue(void *q)
@@ -288,7 +296,8 @@ static int pmem_attach_disk(struct device *dev,
dev_set_drvdata(dev, pmem);
pmem->phys_addr = res->start;
pmem->size = resource_size(res);
- if (nvdimm_has_flush(nd_region) < 0)
+ if (!IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_WT)
+ || nvdimm_has_flush(nd_region) < 0)
dev_warn(dev, "unable to guarantee persistence of writes\n");
if (!devm_request_mem_region(dev, res->start, resource_size(res),
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b7cb5066d961..016af2a6694d 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -947,8 +947,8 @@ void nvdimm_flush(struct nd_region *nd_region)
* The first wmb() is needed to 'sfence' all previous writes
* such that they are architecturally visible for the platform
* buffer flush. Note that we've already arranged for pmem
- * writes to avoid the cache via arch_memcpy_to_pmem(). The
- * final wmb() ensures ordering for the NVDIMM flush write.
+ * writes to avoid the cache via memcpy_wt(). The final wmb()
+ * ensures ordering for the NVDIMM flush write.
*/
wmb();
for (i = 0; i < nd_region->ndr_mappings; i++)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d3158e74a59e..156f067d4db5 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -16,6 +16,9 @@ struct dax_operations {
*/
long (*direct_access)(struct dax_device *, pgoff_t, long,
void **, pfn_t *);
+ /* copy_from_iter: dax-driver override for default copy_from_iter */
+ size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
+ struct iov_iter *);
};
int dax_read_lock(void);
diff --git a/include/linux/string.h b/include/linux/string.h
index 9d6f189157e2..245e0a29b7e5 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -122,6 +122,12 @@ static inline __must_check int memcpy_mcsafe(void *dst, const void *src,
return 0;
}
#endif
+#ifndef __HAVE_ARCH_MEMCPY_WT
+static inline void memcpy_wt(void *dst, const void *src, size_t cnt)
+{
+ memcpy(dst, src, cnt);
+}
+#endif
void *memchr_inv(const void *s, int c, size_t n);
char *strreplace(char *s, char old, char new);
diff --git a/include/linux/uio.h b/include/linux/uio.h
index f2d36a3d3005..30c43aa371b5 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -95,6 +95,21 @@ size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
bool copy_from_iter_full(void *addr, size_t bytes, struct iov_iter *i);
size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i);
+#ifdef CONFIG_ARCH_HAS_UACCESS_WT
+/*
+ * Note, users like pmem that depend on the stricter semantics of
+ * copy_from_iter_wt() than copy_from_iter_nocache() must check
+ * for IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_WRITETHROUGH) before assuming
+ * that the destination is flushed from the cache on return.
+ */
+size_t copy_from_iter_wt(void *addr, size_t bytes, struct iov_iter *i);
+#else
+static inline size_t copy_from_iter_wt(void *addr, size_t bytes,
+ struct iov_iter *i)
+{
+ return copy_from_iter_nocache(addr, bytes, i);
+}
+#endif
bool copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i);
size_t iov_iter_zero(size_t bytes, struct iov_iter *);
unsigned long iov_iter_alignment(const struct iov_iter *i);
diff --git a/lib/Kconfig b/lib/Kconfig
index 0c8b78a9ae2e..f0752a7a9001 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -548,6 +548,9 @@ config ARCH_HAS_SG_CHAIN
config ARCH_HAS_PMEM_API
bool
+config ARCH_HAS_UACCESS_WT
+ bool
+
config ARCH_HAS_MMIO_FLUSH
bool
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f7c93568ec99..19ab9af091f9 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -615,6 +615,27 @@ size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
}
EXPORT_SYMBOL(copy_from_iter_nocache);
+#ifdef CONFIG_ARCH_HAS_UACCESS_WT
+size_t copy_from_iter_wt(void *addr, size_t bytes, struct iov_iter *i)
+{
+ char *to = addr;
+ if (unlikely(i->type & ITER_PIPE)) {
+ WARN_ON(1);
+ return 0;
+ }
+ iterate_and_advance(i, bytes, v,
+ __copy_from_user_inatomic_wt((to += v.iov_len) - v.iov_len,
+ v.iov_base, v.iov_len),
+ memcpy_page_wt((to += v.bv_len) - v.bv_len, v.bv_page,
+ v.bv_offset, v.bv_len),
+ memcpy_wt((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
+ )
+
+ return bytes;
+}
+EXPORT_SYMBOL_GPL(copy_from_iter_wt);
+#endif
+
bool copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i)
{
char *to = addr;
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
2017-04-28 19:39 ` [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations Dan Williams
@ 2017-05-05 6:54 ` Ingo Molnar
2017-05-05 14:12 ` Dan Williams
2017-05-05 20:39 ` Kani, Toshimitsu
2017-05-08 20:32 ` Ross Zwisler
2 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2017-05-05 6:54 UTC (permalink / raw)
To: Dan Williams
Cc: viro, Jan Kara, Matthew Wilcox, x86, linux-kernel, hch,
linux-block, linux-nvdimm, jmoyer, Ingo Molnar, H. Peter Anvin,
linux-fsdevel, Thomas Gleixner, ross.zwisler
* Dan Williams <dan.j.williams@intel.com> wrote:
> The pmem driver has a need to transfer data with a persistent memory
> destination and be able to rely on the fact that the destination writes
> are not cached. It is sufficient for the writes to be flushed to a
> cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
> userspace to call fsync() to ensure data-writes have reached a
> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
> REQ_FLUSH to the pmem driver which will turn around and fence previous
> writes with an "sfence".
>
> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
> that guarantee that the destination buffer is not dirty in the cpu cache
> on completion. The new copy_from_iter_wt and sub-routines will be used
> to replace the "pmem api" (include/linux/pmem.h +
> arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
> and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
> symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
> otherwise.
>
> This is meant to satisfy the concern from Linus that if a driver wants
> to do something beyond the normal nocache semantics it should be
> something private to that driver [1], and Al's concern that anything
> uaccess related belongs with the rest of the uaccess code [2].
>
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
>
> Cc: <x86@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> Changes since the initial RFC:
> * s/writethru/wt/ since we already have ioremap_wt(), set_memory_wt(),
> etc. (Ingo)
Looks good to me. I suspect you'd like to carry this in the nvdimm tree?
Acked-by: Ingo Molnar <mingo@kernel.org>
Thanks,
Ingo
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
2017-05-05 6:54 ` Ingo Molnar
@ 2017-05-05 14:12 ` Dan Williams
0 siblings, 0 replies; 16+ messages in thread
From: Dan Williams @ 2017-05-05 14:12 UTC (permalink / raw)
To: Ingo Molnar
Cc: Al Viro, Jan Kara, Matthew Wilcox, X86 ML,
linux-kernel@vger.kernel.org, Christoph Hellwig, linux-block,
linux-nvdimm@lists.01.org, jmoyer, Ingo Molnar, H. Peter Anvin,
linux-fsdevel, Thomas Gleixner, Ross Zwisler
On Thu, May 4, 2017 at 11:54 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Dan Williams <dan.j.williams@intel.com> wrote:
>
>> The pmem driver has a need to transfer data with a persistent memory
>> destination and be able to rely on the fact that the destination writes
>> are not cached. It is sufficient for the writes to be flushed to a
>> cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
>> userspace to call fsync() to ensure data-writes have reached a
>> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
>> REQ_FLUSH to the pmem driver which will turn around and fence previous
>> writes with an "sfence".
>>
>> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
>> that guarantee that the destination buffer is not dirty in the cpu cache
>> on completion. The new copy_from_iter_wt and sub-routines will be used
>> to replace the "pmem api" (include/linux/pmem.h +
>> arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
>> and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
>> symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
>> otherwise.
>>
>> This is meant to satisfy the concern from Linus that if a driver wants
>> to do something beyond the normal nocache semantics it should be
>> something private to that driver [1], and Al's concern that anything
>> uaccess related belongs with the rest of the uaccess code [2].
>>
>> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
>> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
>>
>> Cc: <x86@kernel.org>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>> Changes since the initial RFC:
>> * s/writethru/wt/ since we already have ioremap_wt(), set_memory_wt(),
>> etc. (Ingo)
>
> Looks good to me. I suspect you'd like to carry this in the nvdimm tree?
>
> Acked-by: Ingo Molnar <mingo@kernel.org>
Thanks, Ingo!. Yes, I'll carry it in nvdimm.git for 4.13.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
2017-04-28 19:39 ` [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations Dan Williams
2017-05-05 6:54 ` Ingo Molnar
@ 2017-05-05 20:39 ` Kani, Toshimitsu
2017-05-05 22:25 ` Dan Williams
2017-05-08 20:32 ` Ross Zwisler
2 siblings, 1 reply; 16+ messages in thread
From: Kani, Toshimitsu @ 2017-05-05 20:39 UTC (permalink / raw)
To: dan.j.williams@intel.com, viro@zeniv.linux.org.uk
Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
jmoyer@redhat.com, tglx@linutronix.de, hch@lst.de, x86@kernel.org,
mawilcox@microsoft.com, hpa@zytor.com, linux-nvdimm@lists.01.org,
mingo@redhat.com, linux-fsdevel@vger.kernel.org,
ross.zwisler@linux.intel.com, jack@suse.cz
T24gRnJpLCAyMDE3LTA0LTI4IGF0IDEyOjM5IC0wNzAwLCBEYW4gV2lsbGlhbXMgd3JvdGU6DQo+
IFRoZSBwbWVtIGRyaXZlciBoYXMgYSBuZWVkIHRvIHRyYW5zZmVyIGRhdGEgd2l0aCBhIHBlcnNp
c3RlbnQgbWVtb3J5DQo+IGRlc3RpbmF0aW9uIGFuZCBiZSBhYmxlIHRvIHJlbHkgb24gdGhlIGZh
Y3QgdGhhdCB0aGUgZGVzdGluYXRpb24NCj4gd3JpdGVzIGFyZSBub3QgY2FjaGVkLiBJdCBpcyBz
dWZmaWNpZW50IGZvciB0aGUgd3JpdGVzIHRvIGJlIGZsdXNoZWQNCj4gdG8gYSBjcHUtc3RvcmUt
YnVmZmVyIChub24tdGVtcG9yYWwgLyAibW92bnQiIGluIHg4NiB0ZXJtcyksIGFzIHdlDQo+IGV4
cGVjdCB1c2Vyc3BhY2UgdG8gY2FsbCBmc3luYygpIHRvIGVuc3VyZSBkYXRhLXdyaXRlcyBoYXZl
IHJlYWNoZWQgYQ0KPiBwb3dlci1mYWlsLXNhZmUgem9uZSBpbiB0aGUgcGxhdGZvcm0uIFRoZSBm
c3luYygpIHRyaWdnZXJzIGEgUkVRX0ZVQQ0KPiBvciBSRVFfRkxVU0ggdG8gdGhlIHBtZW0gZHJp
dmVyIHdoaWNoIHdpbGwgdHVybiBhcm91bmQgYW5kIGZlbmNlDQo+IHByZXZpb3VzIHdyaXRlcyB3
aXRoIGFuICJzZmVuY2UiLg0KPiANCj4gSW1wbGVtZW50IGEgX19jb3B5X2Zyb21fdXNlcl9pbmF0
b21pY193dCwgbWVtY3B5X3BhZ2Vfd3QsIGFuZA0KPiBtZW1jcHlfd3QsIHRoYXQgZ3VhcmFudGVl
IHRoYXQgdGhlIGRlc3RpbmF0aW9uIGJ1ZmZlciBpcyBub3QgZGlydHkgaW4NCj4gdGhlIGNwdSBj
YWNoZSBvbiBjb21wbGV0aW9uLiBUaGUgbmV3IGNvcHlfZnJvbV9pdGVyX3d0IGFuZCBzdWItDQo+
IHJvdXRpbmVzIHdpbGwgYmUgdXNlZCB0byByZXBsYWNlIHRoZSAicG1lbSBhcGkiIChpbmNsdWRl
L2xpbnV4L3BtZW0uaA0KPiArIGFyY2gveDg2L2luY2x1ZGUvYXNtL3BtZW0uaCkuIFRoZSBhdmFp
bGFiaWxpdHkgb2YNCj4gY29weV9mcm9tX2l0ZXJfd3QoKSBhbmQgbWVtY3B5X3d0KCkgYXJlIGdh
dGVkIGJ5IHRoZQ0KPiBDT05GSUdfQVJDSF9IQVNfVUFDQ0VTU19XVCBjb25maWcgc3ltYm9sLCBh
bmQgZmFsbGJhY2sgdG8NCj4gY29weV9mcm9tX2l0ZXJfbm9jYWNoZSgpIGFuZCBwbGFpbiBtZW1j
cHkoKSBvdGhlcndpc2UuDQo+IA0KPiBUaGlzIGlzIG1lYW50IHRvIHNhdGlzZnkgdGhlIGNvbmNl
cm4gZnJvbSBMaW51cyB0aGF0IGlmIGEgZHJpdmVyDQo+IHdhbnRzIHRvIGRvIHNvbWV0aGluZyBi
ZXlvbmQgdGhlIG5vcm1hbCBub2NhY2hlIHNlbWFudGljcyBpdCBzaG91bGQNCj4gYmUgc29tZXRo
aW5nIHByaXZhdGUgdG8gdGhhdCBkcml2ZXIgWzFdLCBhbmQgQWwncyBjb25jZXJuIHRoYXQNCj4g
YW55dGhpbmcgdWFjY2VzcyByZWxhdGVkIGJlbG9uZ3Mgd2l0aCB0aGUgcmVzdCBvZiB0aGUgdWFj
Y2VzcyBjb2RlDQo+IFsyXS4NCj4gDQo+IFsxXTogaHR0cHM6Ly9saXN0cy4wMS5vcmcvcGlwZXJt
YWlsL2xpbnV4LW52ZGltbS8yMDE3LUphbnVhcnkvMDA4MzY0Lg0KPiBodG1sDQo+IFsyXTogaHR0
cHM6Ly9saXN0cy4wMS5vcmcvcGlwZXJtYWlsL2xpbnV4LW52ZGltbS8yMDE3LUFwcmlsLzAwOTk0
Mi5odA0KPiBtbA0KPiANCj4gQ2M6IDx4ODZAa2VybmVsLm9yZz4NCj4gQ2M6IEphbiBLYXJhIDxq
YWNrQHN1c2UuY3o+DQo+IENjOiBKZWZmIE1veWVyIDxqbW95ZXJAcmVkaGF0LmNvbT4NCj4gQ2M6
IEluZ28gTW9sbmFyIDxtaW5nb0ByZWRoYXQuY29tPg0KPiBDYzogQ2hyaXN0b3BoIEhlbGx3aWcg
PGhjaEBsc3QuZGU+DQo+IENjOiAiSC4gUGV0ZXIgQW52aW4iIDxocGFAenl0b3IuY29tPg0KPiBD
YzogQWwgVmlybyA8dmlyb0B6ZW5pdi5saW51eC5vcmcudWs+DQo+IENjOiBUaG9tYXMgR2xlaXhu
ZXIgPHRnbHhAbGludXRyb25peC5kZT4NCj4gQ2M6IE1hdHRoZXcgV2lsY294IDxtYXdpbGNveEBt
aWNyb3NvZnQuY29tPg0KPiBDYzogUm9zcyBad2lzbGVyIDxyb3NzLnp3aXNsZXJAbGludXguaW50
ZWwuY29tPg0KPiBTaWduZWQtb2ZmLWJ5OiBEYW4gV2lsbGlhbXMgPGRhbi5qLndpbGxpYW1zQGlu
dGVsLmNvbT4NCj4gLS0tDQo+IENoYW5nZXMgc2luY2UgdGhlIGluaXRpYWwgUkZDOg0KPiAqIHMv
d3JpdGV0aHJ1L3d0LyBzaW5jZSB3ZSBhbHJlYWR5IGhhdmUgaW9yZW1hcF93dCgpLA0KPiBzZXRf
bWVtb3J5X3d0KCksIGV0Yy4gKEluZ28pDQoNClNvcnJ5IEkgc2hvdWxkIGhhdmUgc2FpZCBlYXJs
aWVyLCBidXQgSSB0aGluayB0aGUgdGVybSAid3QiIGlzDQptaXNsZWFkaW5nLiAgTm9uLXRlbXBv
cmFsIHN0b3JlcyB1c2VkIGluIG1lbWNweV93dCgpIHByb3ZpZGUgV0MNCnNlbWFudGljcywgbm90
IFdUIHNlbWFudGljcy4gIEhvdyBhYm91dCB1c2luZyAibm9jYWNoZSIgYXMgaXQncyBiZWVuDQp1
c2VkIGluIF9fY29weV91c2VyX25vY2FjaGUoKT8NCg0KVGhhbmtzLA0KLVRvc2hpDQo=
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
2017-05-05 20:39 ` Kani, Toshimitsu
@ 2017-05-05 22:25 ` Dan Williams
2017-05-05 22:44 ` Kani, Toshimitsu
0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2017-05-05 22:25 UTC (permalink / raw)
To: Kani, Toshimitsu
Cc: viro@zeniv.linux.org.uk, linux-kernel@vger.kernel.org,
linux-block@vger.kernel.org, jmoyer@redhat.com,
tglx@linutronix.de, hch@lst.de, x86@kernel.org,
mawilcox@microsoft.com, hpa@zytor.com, linux-nvdimm@lists.01.org,
mingo@redhat.com, linux-fsdevel@vger.kernel.org,
ross.zwisler@linux.intel.com, jack@suse.cz
On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> On Fri, 2017-04-28 at 12:39 -0700, Dan Williams wrote:
>> The pmem driver has a need to transfer data with a persistent memory
>> destination and be able to rely on the fact that the destination
>> writes are not cached. It is sufficient for the writes to be flushed
>> to a cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we
>> expect userspace to call fsync() to ensure data-writes have reached a
>> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA
>> or REQ_FLUSH to the pmem driver which will turn around and fence
>> previous writes with an "sfence".
>>
>> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and
>> memcpy_wt, that guarantee that the destination buffer is not dirty in
>> the cpu cache on completion. The new copy_from_iter_wt and sub-
>> routines will be used to replace the "pmem api" (include/linux/pmem.h
>> + arch/x86/include/asm/pmem.h). The availability of
>> copy_from_iter_wt() and memcpy_wt() are gated by the
>> CONFIG_ARCH_HAS_UACCESS_WT config symbol, and fallback to
>> copy_from_iter_nocache() and plain memcpy() otherwise.
>>
>> This is meant to satisfy the concern from Linus that if a driver
>> wants to do something beyond the normal nocache semantics it should
>> be something private to that driver [1], and Al's concern that
>> anything uaccess related belongs with the rest of the uaccess code
>> [2].
>>
>> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.
>> html
>> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.ht
>> ml
>>
>> Cc: <x86@kernel.org>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>> Changes since the initial RFC:
>> * s/writethru/wt/ since we already have ioremap_wt(),
>> set_memory_wt(), etc. (Ingo)
>
> Sorry I should have said earlier, but I think the term "wt" is
> misleading. Non-temporal stores used in memcpy_wt() provide WC
> semantics, not WT semantics.
The non-temporal stores do, but memcpy_wt() is using a combination of
non-temporal stores and explicit cache flushing.
> How about using "nocache" as it's been
> used in __copy_user_nocache()?
The difference in my mind is that the "_nocache" suffix indicates
opportunistic / optional cache pollution avoidance whereas "_wt"
strictly arranges for caches not to contain dirty data upon completion
of the routine. For example, non-temporal stores on older x86 cpus
could potentially leave dirty data in the cache, so memcpy_wt on those
cpus would need to use explicit cache flushing.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
2017-05-05 22:25 ` Dan Williams
@ 2017-05-05 22:44 ` Kani, Toshimitsu
2017-05-06 2:15 ` Dan Williams
0 siblings, 1 reply; 16+ messages in thread
From: Kani, Toshimitsu @ 2017-05-05 22:44 UTC (permalink / raw)
To: dan.j.williams@intel.com
Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
jmoyer@redhat.com, tglx@linutronix.de, hch@lst.de,
viro@zeniv.linux.org.uk, x86@kernel.org, mawilcox@microsoft.com,
hpa@zytor.com, linux-nvdimm@lists.01.org, mingo@redhat.com,
linux-fsdevel@vger.kernel.org, ross.zwisler@linux.intel.com,
jack@suse.cz
T24gRnJpLCAyMDE3LTA1LTA1IGF0IDE1OjI1IC0wNzAwLCBEYW4gV2lsbGlhbXMgd3JvdGU6DQo+
IE9uIEZyaSwgTWF5IDUsIDIwMTcgYXQgMTozOSBQTSwgS2FuaSwgVG9zaGltaXRzdSA8dG9zaGku
a2FuaUBocGUuY29tPg0KPiB3cm90ZToNCiA6DQo+ID4gPiAtLS0NCj4gPiA+IENoYW5nZXMgc2lu
Y2UgdGhlIGluaXRpYWwgUkZDOg0KPiA+ID4gKiBzL3dyaXRldGhydS93dC8gc2luY2Ugd2UgYWxy
ZWFkeSBoYXZlIGlvcmVtYXBfd3QoKSwNCj4gPiA+IHNldF9tZW1vcnlfd3QoKSwgZXRjLiAoSW5n
bykNCj4gPiANCj4gPiBTb3JyeSBJIHNob3VsZCBoYXZlIHNhaWQgZWFybGllciwgYnV0IEkgdGhp
bmsgdGhlIHRlcm0gInd0IiBpcw0KPiA+IG1pc2xlYWRpbmcuwqDCoE5vbi10ZW1wb3JhbCBzdG9y
ZXMgdXNlZCBpbiBtZW1jcHlfd3QoKSBwcm92aWRlIFdDDQo+ID4gc2VtYW50aWNzLCBub3QgV1Qg
c2VtYW50aWNzLg0KPiANCj4gVGhlIG5vbi10ZW1wb3JhbCBzdG9yZXMgZG8sIGJ1dCBtZW1jcHlf
d3QoKSBpcyB1c2luZyBhIGNvbWJpbmF0aW9uIG9mDQo+IG5vbi10ZW1wb3JhbCBzdG9yZXMgYW5k
IGV4cGxpY2l0IGNhY2hlIGZsdXNoaW5nLg0KPiANCj4gPiBIb3cgYWJvdXQgdXNpbmcgIm5vY2Fj
aGUiIGFzIGl0J3MgYmVlbg0KPiA+IHVzZWQgaW4gX19jb3B5X3VzZXJfbm9jYWNoZSgpPw0KPiAN
Cj4gVGhlIGRpZmZlcmVuY2UgaW4gbXkgbWluZCBpcyB0aGF0IHRoZSAiX25vY2FjaGUiIHN1ZmZp
eCBpbmRpY2F0ZXMNCj4gb3Bwb3J0dW5pc3RpYyAvIG9wdGlvbmFsIGNhY2hlIHBvbGx1dGlvbiBh
dm9pZGFuY2Ugd2hlcmVhcyAiX3d0Ig0KPiBzdHJpY3RseSBhcnJhbmdlcyBmb3IgY2FjaGVzIG5v
dCB0byBjb250YWluIGRpcnR5IGRhdGEgdXBvbg0KPiBjb21wbGV0aW9uIG9mIHRoZSByb3V0aW5l
LiBGb3IgZXhhbXBsZSwgbm9uLXRlbXBvcmFsIHN0b3JlcyBvbiBvbGRlcg0KPiB4ODYgY3B1cyBj
b3VsZCBwb3RlbnRpYWxseSBsZWF2ZSBkaXJ0eSBkYXRhIGluIHRoZSBjYWNoZSwgc28NCj4gbWVt
Y3B5X3d0IG9uIHRob3NlIGNwdXMgd291bGQgbmVlZCB0byB1c2UgZXhwbGljaXQgY2FjaGUgZmx1
c2hpbmcuDQoNCkkgc2VlLiAgSSBhZ3JlZSB0aGF0IGl0cyBiZWhhdmlvciBpcyBkaWZmZXJlbnQg
ZnJvbSB0aGUgZXhpc3Rpbmcgb25lDQp3aXRoICJfbm9jYWNoZSIuICAgVGhhdCBzYWlkLCBJIHRo
aW5rICJ3dCIgb3IgIndyaXRlLXRocm91Z2giIGdlbmVyYWxseQ0KbWVhbnMgdGhhdCB3cml0ZXMg
YWxsb2NhdGUgY2FjaGVsaW5lcyBhbmQga2VlcCB0aGVtIGNsZWFuIGJ5IHdyaXRpbmcgdG8NCm1l
bW9yeS4gIFNvLCBzdWJzZXF1ZW50IHJlYWRzIHRvIHRoZSBkZXN0aW5hdGlvbiB3aWxsIGhpdCB0
aGUNCmNhY2hlbGluZXMuICBUaGlzIGlzIG5vdCB0aGUgY2FzZSB3aXRoIHRoaXMgaW50ZXJmYWNl
Lg0KDQpUaGFua3MsDQotVG9zaGkNCiA=
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
2017-05-05 22:44 ` Kani, Toshimitsu
@ 2017-05-06 2:15 ` Dan Williams
2017-05-06 3:17 ` Kani, Toshimitsu
2017-05-06 9:46 ` Ingo Molnar
0 siblings, 2 replies; 16+ messages in thread
From: Dan Williams @ 2017-05-06 2:15 UTC (permalink / raw)
To: Kani, Toshimitsu
Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
jmoyer@redhat.com, tglx@linutronix.de, hch@lst.de,
viro@zeniv.linux.org.uk, x86@kernel.org, mawilcox@microsoft.com,
hpa@zytor.com, linux-nvdimm@lists.01.org, mingo@redhat.com,
linux-fsdevel@vger.kernel.org, ross.zwisler@linux.intel.com,
jack@suse.cz
On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
>> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
>> wrote:
> :
>> > > ---
>> > > Changes since the initial RFC:
>> > > * s/writethru/wt/ since we already have ioremap_wt(),
>> > > set_memory_wt(), etc. (Ingo)
>> >
>> > Sorry I should have said earlier, but I think the term "wt" is
>> > misleading. Non-temporal stores used in memcpy_wt() provide WC
>> > semantics, not WT semantics.
>>
>> The non-temporal stores do, but memcpy_wt() is using a combination of
>> non-temporal stores and explicit cache flushing.
>>
>> > How about using "nocache" as it's been
>> > used in __copy_user_nocache()?
>>
>> The difference in my mind is that the "_nocache" suffix indicates
>> opportunistic / optional cache pollution avoidance whereas "_wt"
>> strictly arranges for caches not to contain dirty data upon
>> completion of the routine. For example, non-temporal stores on older
>> x86 cpus could potentially leave dirty data in the cache, so
>> memcpy_wt on those cpus would need to use explicit cache flushing.
>
> I see. I agree that its behavior is different from the existing one
> with "_nocache". That said, I think "wt" or "write-through" generally
> means that writes allocate cachelines and keep them clean by writing to
> memory. So, subsequent reads to the destination will hit the
> cachelines. This is not the case with this interface.
True... maybe _nocache_strict()? Or, leave it _wt() until someone
comes along and is surprised that the cache is not warm for reads
after memcpy_wt(), at which point we can ask "why not just use plain
memcpy then?", or set the page-attributes to WT.
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
2017-05-06 2:15 ` Dan Williams
@ 2017-05-06 3:17 ` Kani, Toshimitsu
2017-05-06 9:46 ` Ingo Molnar
1 sibling, 0 replies; 16+ messages in thread
From: Kani, Toshimitsu @ 2017-05-06 3:17 UTC (permalink / raw)
To: Dan Williams
Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
jmoyer@redhat.com, tglx@linutronix.de, hch@lst.de,
viro@zeniv.linux.org.uk, x86@kernel.org, mawilcox@microsoft.com,
hpa@zytor.com, linux-nvdimm@lists.01.org, mingo@redhat.com,
linux-fsdevel@vger.kernel.org, ross.zwisler@linux.intel.com,
jack@suse.cz
PiBPbiBGcmksIE1heSA1LCAyMDE3IGF0IDM6NDQgUE0sIEthbmksIFRvc2hpbWl0c3UgPHRvc2hp
LmthbmlAaHBlLmNvbT4NCj4gd3JvdGU6DQo+ID4gT24gRnJpLCAyMDE3LTA1LTA1IGF0IDE1OjI1
IC0wNzAwLCBEYW4gV2lsbGlhbXMgd3JvdGU6DQo+ID4+IE9uIEZyaSwgTWF5IDUsIDIwMTcgYXQg
MTozOSBQTSwgS2FuaSwgVG9zaGltaXRzdSA8dG9zaGkua2FuaUBocGUuY29tPg0KPiA+PiB3cm90
ZToNCj4gPiAgOg0KPiA+PiA+ID4gLS0tDQo+ID4+ID4gPiBDaGFuZ2VzIHNpbmNlIHRoZSBpbml0
aWFsIFJGQzoNCj4gPj4gPiA+ICogcy93cml0ZXRocnUvd3QvIHNpbmNlIHdlIGFscmVhZHkgaGF2
ZSBpb3JlbWFwX3d0KCksDQo+ID4+ID4gPiBzZXRfbWVtb3J5X3d0KCksIGV0Yy4gKEluZ28pDQo+
ID4+ID4NCj4gPj4gPiBTb3JyeSBJIHNob3VsZCBoYXZlIHNhaWQgZWFybGllciwgYnV0IEkgdGhp
bmsgdGhlIHRlcm0gInd0IiBpcw0KPiA+PiA+IG1pc2xlYWRpbmcuICBOb24tdGVtcG9yYWwgc3Rv
cmVzIHVzZWQgaW4gbWVtY3B5X3d0KCkgcHJvdmlkZSBXQw0KPiA+PiA+IHNlbWFudGljcywgbm90
IFdUIHNlbWFudGljcy4NCj4gPj4NCj4gPj4gVGhlIG5vbi10ZW1wb3JhbCBzdG9yZXMgZG8sIGJ1
dCBtZW1jcHlfd3QoKSBpcyB1c2luZyBhIGNvbWJpbmF0aW9uIG9mDQo+ID4+IG5vbi10ZW1wb3Jh
bCBzdG9yZXMgYW5kIGV4cGxpY2l0IGNhY2hlIGZsdXNoaW5nLg0KPiA+Pg0KPiA+PiA+IEhvdyBh
Ym91dCB1c2luZyAibm9jYWNoZSIgYXMgaXQncyBiZWVuDQo+ID4+ID4gdXNlZCBpbiBfX2NvcHlf
dXNlcl9ub2NhY2hlKCk/DQo+ID4+DQo+ID4+IFRoZSBkaWZmZXJlbmNlIGluIG15IG1pbmQgaXMg
dGhhdCB0aGUgIl9ub2NhY2hlIiBzdWZmaXggaW5kaWNhdGVzDQo+ID4+IG9wcG9ydHVuaXN0aWMg
LyBvcHRpb25hbCBjYWNoZSBwb2xsdXRpb24gYXZvaWRhbmNlIHdoZXJlYXMgIl93dCINCj4gPj4g
c3RyaWN0bHkgYXJyYW5nZXMgZm9yIGNhY2hlcyBub3QgdG8gY29udGFpbiBkaXJ0eSBkYXRhIHVw
b24NCj4gPj4gY29tcGxldGlvbiBvZiB0aGUgcm91dGluZS4gRm9yIGV4YW1wbGUsIG5vbi10ZW1w
b3JhbCBzdG9yZXMgb24gb2xkZXINCj4gPj4geDg2IGNwdXMgY291bGQgcG90ZW50aWFsbHkgbGVh
dmUgZGlydHkgZGF0YSBpbiB0aGUgY2FjaGUsIHNvDQo+ID4+IG1lbWNweV93dCBvbiB0aG9zZSBj
cHVzIHdvdWxkIG5lZWQgdG8gdXNlIGV4cGxpY2l0IGNhY2hlIGZsdXNoaW5nLg0KPiA+DQo+ID4g
SSBzZWUuICBJIGFncmVlIHRoYXQgaXRzIGJlaGF2aW9yIGlzIGRpZmZlcmVudCBmcm9tIHRoZSBl
eGlzdGluZyBvbmUNCj4gPiB3aXRoICJfbm9jYWNoZSIuICAgVGhhdCBzYWlkLCBJIHRoaW5rICJ3
dCIgb3IgIndyaXRlLXRocm91Z2giIGdlbmVyYWxseQ0KPiA+IG1lYW5zIHRoYXQgd3JpdGVzIGFs
bG9jYXRlIGNhY2hlbGluZXMgYW5kIGtlZXAgdGhlbSBjbGVhbiBieSB3cml0aW5nIHRvDQo+ID4g
bWVtb3J5LiAgU28sIHN1YnNlcXVlbnQgcmVhZHMgdG8gdGhlIGRlc3RpbmF0aW9uIHdpbGwgaGl0
IHRoZQ0KPiA+IGNhY2hlbGluZXMuICBUaGlzIGlzIG5vdCB0aGUgY2FzZSB3aXRoIHRoaXMgaW50
ZXJmYWNlLg0KPiANCj4gVHJ1ZS4uLiBtYXliZSBfbm9jYWNoZV9zdHJpY3QoKT8gT3IsIGxlYXZl
IGl0IF93dCgpIHVudGlsIHNvbWVvbmUNCj4gY29tZXMgYWxvbmcgYW5kIGlzIHN1cnByaXNlZCB0
aGF0IHRoZSBjYWNoZSBpcyBub3Qgd2FybSBmb3IgcmVhZHMNCj4gYWZ0ZXIgbWVtY3B5X3d0KCks
IGF0IHdoaWNoIHBvaW50IHdlIGNhbiBhc2sgIndoeSBub3QganVzdCB1c2UgcGxhaW4NCj4gbWVt
Y3B5IHRoZW4/Iiwgb3Igc2V0IHRoZSBwYWdlLWF0dHJpYnV0ZXMgdG8gV1QuDQoNCkkgcHJlZmVy
IF9ub2NhY2hlX3N0cmljdCgpLCBpZiBpdCdzIG5vdCB0b28gbG9uZywgc2luY2UgaXQgYXZvaWRz
IGFueQ0KY29uZnVzaW9uLiAgSWYgb3RoZXIgYXJjaGVzIGFjdHVhbGx5IGltcGxlbWVudCBpdCB3
aXRoIFdUIHNlbWFudGljcywNCndlIG1pZ2h0IGJlY29tZSB0aGUgb25lIHRvIGNoYW5nZSBpdCwg
aW5zdGVhZCBvZiB0aGUgY2FsbGVyLg0KDQpUaGFua3MsDQotVG9zaGkNCg0K
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
2017-05-06 2:15 ` Dan Williams
2017-05-06 3:17 ` Kani, Toshimitsu
@ 2017-05-06 9:46 ` Ingo Molnar
2017-05-06 13:57 ` Dan Williams
1 sibling, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2017-05-06 9:46 UTC (permalink / raw)
To: Dan Williams
Cc: Kani, Toshimitsu, linux-kernel@vger.kernel.org,
linux-block@vger.kernel.org, jmoyer@redhat.com,
tglx@linutronix.de, hch@lst.de, viro@zeniv.linux.org.uk,
x86@kernel.org, mawilcox@microsoft.com, hpa@zytor.com,
linux-nvdimm@lists.01.org, mingo@redhat.com,
linux-fsdevel@vger.kernel.org, ross.zwisler@linux.intel.com,
jack@suse.cz
* Dan Williams <dan.j.williams@intel.com> wrote:
> On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> > On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
> >> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> >> wrote:
> > :
> >> > > ---
> >> > > Changes since the initial RFC:
> >> > > * s/writethru/wt/ since we already have ioremap_wt(),
> >> > > set_memory_wt(), etc. (Ingo)
> >> >
> >> > Sorry I should have said earlier, but I think the term "wt" is
> >> > misleading. Non-temporal stores used in memcpy_wt() provide WC
> >> > semantics, not WT semantics.
> >>
> >> The non-temporal stores do, but memcpy_wt() is using a combination of
> >> non-temporal stores and explicit cache flushing.
> >>
> >> > How about using "nocache" as it's been
> >> > used in __copy_user_nocache()?
> >>
> >> The difference in my mind is that the "_nocache" suffix indicates
> >> opportunistic / optional cache pollution avoidance whereas "_wt"
> >> strictly arranges for caches not to contain dirty data upon
> >> completion of the routine. For example, non-temporal stores on older
> >> x86 cpus could potentially leave dirty data in the cache, so
> >> memcpy_wt on those cpus would need to use explicit cache flushing.
> >
> > I see. I agree that its behavior is different from the existing one
> > with "_nocache". That said, I think "wt" or "write-through" generally
> > means that writes allocate cachelines and keep them clean by writing to
> > memory. So, subsequent reads to the destination will hit the
> > cachelines. This is not the case with this interface.
>
> True... maybe _nocache_strict()? Or, leave it _wt() until someone
> comes along and is surprised that the cache is not warm for reads
> after memcpy_wt(), at which point we can ask "why not just use plain
> memcpy then?", or set the page-attributes to WT.
Perhaps a _nocache_flush() postfix, to signal both that it's non-temporal and that
no cache line is left around afterwards (dirty or clean)?
Thanks,
Ingo
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
2017-05-06 9:46 ` Ingo Molnar
@ 2017-05-06 13:57 ` Dan Williams
2017-05-07 8:57 ` Ingo Molnar
0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2017-05-06 13:57 UTC (permalink / raw)
To: Ingo Molnar
Cc: Kani, Toshimitsu, linux-kernel@vger.kernel.org,
linux-block@vger.kernel.org, jmoyer@redhat.com,
tglx@linutronix.de, hch@lst.de, viro@zeniv.linux.org.uk,
x86@kernel.org, mawilcox@microsoft.com, hpa@zytor.com,
linux-nvdimm@lists.01.org, mingo@redhat.com,
linux-fsdevel@vger.kernel.org, ross.zwisler@linux.intel.com,
jack@suse.cz
On Sat, May 6, 2017 at 2:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Dan Williams <dan.j.williams@intel.com> wrote:
>
>> On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
>> > On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
>> >> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
>> >> wrote:
>> > :
>> >> > > ---
>> >> > > Changes since the initial RFC:
>> >> > > * s/writethru/wt/ since we already have ioremap_wt(),
>> >> > > set_memory_wt(), etc. (Ingo)
>> >> >
>> >> > Sorry I should have said earlier, but I think the term "wt" is
>> >> > misleading. Non-temporal stores used in memcpy_wt() provide WC
>> >> > semantics, not WT semantics.
>> >>
>> >> The non-temporal stores do, but memcpy_wt() is using a combination of
>> >> non-temporal stores and explicit cache flushing.
>> >>
>> >> > How about using "nocache" as it's been
>> >> > used in __copy_user_nocache()?
>> >>
>> >> The difference in my mind is that the "_nocache" suffix indicates
>> >> opportunistic / optional cache pollution avoidance whereas "_wt"
>> >> strictly arranges for caches not to contain dirty data upon
>> >> completion of the routine. For example, non-temporal stores on older
>> >> x86 cpus could potentially leave dirty data in the cache, so
>> >> memcpy_wt on those cpus would need to use explicit cache flushing.
>> >
>> > I see. I agree that its behavior is different from the existing one
>> > with "_nocache". That said, I think "wt" or "write-through" generally
>> > means that writes allocate cachelines and keep them clean by writing to
>> > memory. So, subsequent reads to the destination will hit the
>> > cachelines. This is not the case with this interface.
>>
>> True... maybe _nocache_strict()? Or, leave it _wt() until someone
>> comes along and is surprised that the cache is not warm for reads
>> after memcpy_wt(), at which point we can ask "why not just use plain
>> memcpy then?", or set the page-attributes to WT.
>
> Perhaps a _nocache_flush() postfix, to signal both that it's non-temporal and that
> no cache line is left around afterwards (dirty or clean)?
Yes, I think "flush" belongs in the name, and to make it easily
grep-able separate from _nocache we can call it _flushcache? An
efficient implementation will use _nocache / non-temporal stores
internally, but external consumers just care about the state of the
cache after the call.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
2017-05-06 13:57 ` Dan Williams
@ 2017-05-07 8:57 ` Ingo Molnar
2017-05-08 3:01 ` Kani, Toshimitsu
0 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2017-05-07 8:57 UTC (permalink / raw)
To: Dan Williams
Cc: Kani, Toshimitsu, linux-kernel@vger.kernel.org,
linux-block@vger.kernel.org, jmoyer@redhat.com,
tglx@linutronix.de, hch@lst.de, viro@zeniv.linux.org.uk,
x86@kernel.org, mawilcox@microsoft.com, hpa@zytor.com,
linux-nvdimm@lists.01.org, mingo@redhat.com,
linux-fsdevel@vger.kernel.org, ross.zwisler@linux.intel.com,
jack@suse.cz
* Dan Williams <dan.j.williams@intel.com> wrote:
> On Sat, May 6, 2017 at 2:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Dan Williams <dan.j.williams@intel.com> wrote:
> >
> >> On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> >> > On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
> >> >> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> >> >> wrote:
> >> > :
> >> >> > > ---
> >> >> > > Changes since the initial RFC:
> >> >> > > * s/writethru/wt/ since we already have ioremap_wt(),
> >> >> > > set_memory_wt(), etc. (Ingo)
> >> >> >
> >> >> > Sorry I should have said earlier, but I think the term "wt" is
> >> >> > misleading. Non-temporal stores used in memcpy_wt() provide WC
> >> >> > semantics, not WT semantics.
> >> >>
> >> >> The non-temporal stores do, but memcpy_wt() is using a combination of
> >> >> non-temporal stores and explicit cache flushing.
> >> >>
> >> >> > How about using "nocache" as it's been
> >> >> > used in __copy_user_nocache()?
> >> >>
> >> >> The difference in my mind is that the "_nocache" suffix indicates
> >> >> opportunistic / optional cache pollution avoidance whereas "_wt"
> >> >> strictly arranges for caches not to contain dirty data upon
> >> >> completion of the routine. For example, non-temporal stores on older
> >> >> x86 cpus could potentially leave dirty data in the cache, so
> >> >> memcpy_wt on those cpus would need to use explicit cache flushing.
> >> >
> >> > I see. I agree that its behavior is different from the existing one
> >> > with "_nocache". That said, I think "wt" or "write-through" generally
> >> > means that writes allocate cachelines and keep them clean by writing to
> >> > memory. So, subsequent reads to the destination will hit the
> >> > cachelines. This is not the case with this interface.
> >>
> >> True... maybe _nocache_strict()? Or, leave it _wt() until someone
> >> comes along and is surprised that the cache is not warm for reads
> >> after memcpy_wt(), at which point we can ask "why not just use plain
> >> memcpy then?", or set the page-attributes to WT.
> >
> > Perhaps a _nocache_flush() postfix, to signal both that it's non-temporal and that
> > no cache line is left around afterwards (dirty or clean)?
>
> Yes, I think "flush" belongs in the name, and to make it easily
> grep-able separate from _nocache we can call it _flushcache? An
> efficient implementation will use _nocache / non-temporal stores
> internally, but external consumers just care about the state of the
> cache after the call.
_flushcache() works for me too.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
2017-05-07 8:57 ` Ingo Molnar
@ 2017-05-08 3:01 ` Kani, Toshimitsu
0 siblings, 0 replies; 16+ messages in thread
From: Kani, Toshimitsu @ 2017-05-08 3:01 UTC (permalink / raw)
To: Ingo Molnar, Dan Williams
Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
jmoyer@redhat.com, tglx@linutronix.de, hch@lst.de,
viro@zeniv.linux.org.uk, x86@kernel.org, mawilcox@microsoft.com,
hpa@zytor.com, linux-nvdimm@lists.01.org, mingo@redhat.com,
linux-fsdevel@vger.kernel.org, ross.zwisler@linux.intel.com,
jack@suse.cz
> * Dan Williams <dan.j.williams@intel.com> wrote:
>
> > On Sat, May 6, 2017 at 2:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
> > >
> > > * Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > >> On Fri, May 5, 2017 at 3:44 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> wrote:
> > >> > On Fri, 2017-05-05 at 15:25 -0700, Dan Williams wrote:
> > >> >> On Fri, May 5, 2017 at 1:39 PM, Kani, Toshimitsu
> <toshi.kani@hpe.com>
> > >> >> wrote:
> > >> > :
> > >> >> > > ---
> > >> >> > > Changes since the initial RFC:
> > >> >> > > * s/writethru/wt/ since we already have ioremap_wt(),
> > >> >> > > set_memory_wt(), etc. (Ingo)
> > >> >> >
> > >> >> > Sorry I should have said earlier, but I think the term "wt" is
> > >> >> > misleading. Non-temporal stores used in memcpy_wt() provide WC
> > >> >> > semantics, not WT semantics.
> > >> >>
> > >> >> The non-temporal stores do, but memcpy_wt() is using a combination
> of
> > >> >> non-temporal stores and explicit cache flushing.
> > >> >>
> > >> >> > How about using "nocache" as it's been
> > >> >> > used in __copy_user_nocache()?
> > >> >>
> > >> >> The difference in my mind is that the "_nocache" suffix indicates
> > >> >> opportunistic / optional cache pollution avoidance whereas "_wt"
> > >> >> strictly arranges for caches not to contain dirty data upon
> > >> >> completion of the routine. For example, non-temporal stores on older
> > >> >> x86 cpus could potentially leave dirty data in the cache, so
> > >> >> memcpy_wt on those cpus would need to use explicit cache flushing.
> > >> >
> > >> > I see. I agree that its behavior is different from the existing one
> > >> > with "_nocache". That said, I think "wt" or "write-through" generally
> > >> > means that writes allocate cachelines and keep them clean by writing
> to
> > >> > memory. So, subsequent reads to the destination will hit the
> > >> > cachelines. This is not the case with this interface.
> > >>
> > >> True... maybe _nocache_strict()? Or, leave it _wt() until someone
> > >> comes along and is surprised that the cache is not warm for reads
> > >> after memcpy_wt(), at which point we can ask "why not just use plain
> > >> memcpy then?", or set the page-attributes to WT.
> > >
> > > Perhaps a _nocache_flush() postfix, to signal both that it's non-temporal
> and that
> > > no cache line is left around afterwards (dirty or clean)?
> >
> > Yes, I think "flush" belongs in the name, and to make it easily
> > grep-able separate from _nocache we can call it _flushcache? An
> > efficient implementation will use _nocache / non-temporal stores
> > internally, but external consumers just care about the state of the
> > cache after the call.
>
> _flushcache() works for me too.
>
Works for me too.
Thanks,
-Toshi
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
2017-04-28 19:39 ` [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations Dan Williams
2017-05-05 6:54 ` Ingo Molnar
2017-05-05 20:39 ` Kani, Toshimitsu
@ 2017-05-08 20:32 ` Ross Zwisler
2017-05-08 20:40 ` Dan Williams
2 siblings, 1 reply; 16+ messages in thread
From: Ross Zwisler @ 2017-05-08 20:32 UTC (permalink / raw)
To: Dan Williams
Cc: viro, Jan Kara, Matthew Wilcox, x86, linux-kernel, hch,
linux-block, linux-nvdimm, jmoyer, Ingo Molnar, H. Peter Anvin,
linux-fsdevel, Thomas Gleixner, ross.zwisler
On Fri, Apr 28, 2017 at 12:39:12PM -0700, Dan Williams wrote:
> The pmem driver has a need to transfer data with a persistent memory
> destination and be able to rely on the fact that the destination writes
> are not cached. It is sufficient for the writes to be flushed to a
> cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
> userspace to call fsync() to ensure data-writes have reached a
> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
> REQ_FLUSH to the pmem driver which will turn around and fence previous
> writes with an "sfence".
>
> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
> that guarantee that the destination buffer is not dirty in the cpu cache
> on completion. The new copy_from_iter_wt and sub-routines will be used
> to replace the "pmem api" (include/linux/pmem.h +
> arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
> and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
> symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
> otherwise.
>
> This is meant to satisfy the concern from Linus that if a driver wants
> to do something beyond the normal nocache semantics it should be
> something private to that driver [1], and Al's concern that anything
> uaccess related belongs with the rest of the uaccess code [2].
>
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
>
> Cc: <x86@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
<>
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index c5504b9a472e..07ded30c7e89 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -171,6 +171,10 @@ unsigned long raw_copy_in_user(void __user *dst, const void __user *src, unsigne
> extern long __copy_user_nocache(void *dst, const void __user *src,
> unsigned size, int zerorest);
>
> +extern long __copy_user_wt(void *dst, const void __user *src, unsigned size);
> +extern void memcpy_page_wt(char *to, struct page *page, size_t offset,
> + size_t len);
> +
> static inline int
> __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
> unsigned size)
> @@ -179,6 +183,13 @@ __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
> return __copy_user_nocache(dst, src, size, 0);
> }
>
> +static inline int
> +__copy_from_user_inatomic_wt(void *dst, const void __user *src, unsigned size)
> +{
> + kasan_check_write(dst, size);
> + return __copy_user_wt(dst, src, size);
> +}
> +
> unsigned long
> copy_user_handle_tail(char *to, char *from, unsigned len);
>
> diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
> index 3b7c40a2e3e1..0aeff66a022f 100644
> --- a/arch/x86/lib/usercopy_64.c
> +++ b/arch/x86/lib/usercopy_64.c
> @@ -7,6 +7,7 @@
> */
> #include <linux/export.h>
> #include <linux/uaccess.h>
> +#include <linux/highmem.h>
>
> /*
> * Zero Userspace
> @@ -73,3 +74,130 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
> clac();
> return len;
> }
> +
> +#ifdef CONFIG_ARCH_HAS_UACCESS_WT
> +/**
> + * clean_cache_range - write back a cache range with CLWB
> + * @vaddr: virtual start address
> + * @size: number of bytes to write back
> + *
> + * Write back a cache range using the CLWB (cache line write back)
> + * instruction. Note that @size is internally rounded up to be cache
> + * line size aligned.
> + */
> +static void clean_cache_range(void *addr, size_t size)
> +{
> + u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
> + unsigned long clflush_mask = x86_clflush_size - 1;
> + void *vend = addr + size;
> + void *p;
> +
> + for (p = (void *)((unsigned long)addr & ~clflush_mask);
> + p < vend; p += x86_clflush_size)
> + clwb(p);
> +}
> +
> +long __copy_user_wt(void *dst, const void __user *src, unsigned size)
> +{
> + unsigned long flushed, dest = (unsigned long) dst;
> + long rc = __copy_user_nocache(dst, src, size, 0);
> +
> + /*
> + * __copy_user_nocache() uses non-temporal stores for the bulk
> + * of the transfer, but we need to manually flush if the
> + * transfer is unaligned. A cached memory copy is used when
> + * destination or size is not naturally aligned. That is:
> + * - Require 8-byte alignment when size is 8 bytes or larger.
> + * - Require 4-byte alignment when size is 4 bytes.
> + */
> + if (size < 8) {
> + if (!IS_ALIGNED(dest, 4) || size != 4)
> + clean_cache_range(dst, 1);
> + } else {
> + if (!IS_ALIGNED(dest, 8)) {
> + dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
> + clean_cache_range(dst, 1);
> + }
> +
> + flushed = dest - (unsigned long) dst;
> + if (size > flushed && !IS_ALIGNED(size - flushed, 8))
> + clean_cache_range(dst + size - 1, 1);
> + }
> +
> + return rc;
> +}
> +
> +void memcpy_wt(void *_dst, const void *_src, size_t size)
> +{
> + unsigned long dest = (unsigned long) _dst;
> + unsigned long source = (unsigned long) _src;
> +
> + /* cache copy and flush to align dest */
> + if (!IS_ALIGNED(dest, 8)) {
> + unsigned len = min_t(unsigned, size, ALIGN(dest, 8) - dest);
> +
> + memcpy((void *) dest, (void *) source, len);
> + clean_cache_range((void *) dest, len);
> + dest += len;
> + source += len;
> + size -= len;
> + if (!size)
> + return;
> + }
> +
> + /* 4x8 movnti loop */
> + while (size >= 32) {
> + asm("movq (%0), %%r8\n"
> + "movq 8(%0), %%r9\n"
> + "movq 16(%0), %%r10\n"
> + "movq 24(%0), %%r11\n"
> + "movnti %%r8, (%1)\n"
> + "movnti %%r9, 8(%1)\n"
> + "movnti %%r10, 16(%1)\n"
> + "movnti %%r11, 24(%1)\n"
> + :: "r" (source), "r" (dest)
> + : "memory", "r8", "r9", "r10", "r11");
> + dest += 32;
> + source += 32;
> + size -= 32;
> + }
> +
> + /* 1x8 movnti loop */
> + while (size >= 8) {
> + asm("movq (%0), %%r8\n"
> + "movnti %%r8, (%1)\n"
> + :: "r" (source), "r" (dest)
> + : "memory", "r8");
> + dest += 8;
> + source += 8;
> + size -= 8;
> + }
> +
> + /* 1x4 movnti loop */
> + while (size >= 4) {
> + asm("movl (%0), %%r8d\n"
> + "movnti %%r8d, (%1)\n"
> + :: "r" (source), "r" (dest)
> + : "memory", "r8");
> + dest += 4;
> + source += 4;
> + size -= 4;
> + }
> +
> + /* cache copy for remaining bytes */
> + if (size) {
> + memcpy((void *) dest, (void *) source, size);
> + clean_cache_range((void *) dest, size);
> + }
> +}
> +EXPORT_SYMBOL_GPL(memcpy_wt);
I took a pretty hard look at the changes in arch/x86/lib/usercopy_64.c, and
they look correct to me. The inline assembly for non-temporal copies mixed
with C for loop control is IMHO much easier to follow than the pure assembly
of __copy_user_nocache().
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations
2017-05-08 20:32 ` Ross Zwisler
@ 2017-05-08 20:40 ` Dan Williams
0 siblings, 0 replies; 16+ messages in thread
From: Dan Williams @ 2017-05-08 20:40 UTC (permalink / raw)
To: Ross Zwisler, Dan Williams, Al Viro, Jan Kara, Matthew Wilcox,
X86 ML, linux-kernel@vger.kernel.org, Christoph Hellwig,
linux-block, linux-nvdimm@lists.01.org, jmoyer, Ingo Molnar,
H. Peter Anvin, linux-fsdevel, Thomas Gleixner
On Mon, May 8, 2017 at 1:32 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Fri, Apr 28, 2017 at 12:39:12PM -0700, Dan Williams wrote:
>> The pmem driver has a need to transfer data with a persistent memory
>> destination and be able to rely on the fact that the destination writes
>> are not cached. It is sufficient for the writes to be flushed to a
>> cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect
>> userspace to call fsync() to ensure data-writes have reached a
>> power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or
>> REQ_FLUSH to the pmem driver which will turn around and fence previous
>> writes with an "sfence".
>>
>> Implement a __copy_from_user_inatomic_wt, memcpy_page_wt, and memcpy_wt,
>> that guarantee that the destination buffer is not dirty in the cpu cache
>> on completion. The new copy_from_iter_wt and sub-routines will be used
>> to replace the "pmem api" (include/linux/pmem.h +
>> arch/x86/include/asm/pmem.h). The availability of copy_from_iter_wt()
>> and memcpy_wt() are gated by the CONFIG_ARCH_HAS_UACCESS_WT config
>> symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
>> otherwise.
>>
>> This is meant to satisfy the concern from Linus that if a driver wants
>> to do something beyond the normal nocache semantics it should be
>> something private to that driver [1], and Al's concern that anything
>> uaccess related belongs with the rest of the uaccess code [2].
>>
>> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
>> [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
>>
>> Cc: <x86@kernel.org>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
[..]
> I took a pretty hard look at the changes in arch/x86/lib/usercopy_64.c, and
> they look correct to me. The inline assembly for non-temporal copies mixed
> with C for loop control is IMHO much easier to follow than the pure assembly
> of __copy_user_nocache().
>
> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Thanks Ross, I appreciate it.
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2017-05-08 20:40 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20170425012230.GX29622@ZenIV.linux.org.uk>
2017-04-26 21:56 ` [RFC PATCH] x86, uaccess, pmem: introduce copy_from_iter_writethru for dax + pmem Dan Williams
2017-04-27 6:30 ` Ingo Molnar
2017-04-28 19:39 ` [PATCH v2] x86, uaccess: introduce copy_from_iter_wt for pmem / writethrough operations Dan Williams
2017-05-05 6:54 ` Ingo Molnar
2017-05-05 14:12 ` Dan Williams
2017-05-05 20:39 ` Kani, Toshimitsu
2017-05-05 22:25 ` Dan Williams
2017-05-05 22:44 ` Kani, Toshimitsu
2017-05-06 2:15 ` Dan Williams
2017-05-06 3:17 ` Kani, Toshimitsu
2017-05-06 9:46 ` Ingo Molnar
2017-05-06 13:57 ` Dan Williams
2017-05-07 8:57 ` Ingo Molnar
2017-05-08 3:01 ` Kani, Toshimitsu
2017-05-08 20:32 ` Ross Zwisler
2017-05-08 20:40 ` Dan Williams
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).