* [PATCH v7 00/12] MAP_DIRECT for DAX RDMA and userspace flush @ 2017-10-06 22:35 Dan Williams 2017-10-06 22:35 ` [PATCH v7 03/12] fs: introduce i_mapdcount Dan Williams ` (7 more replies) 0 siblings, 8 replies; 49+ messages in thread From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw) To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw Cc: Jan Kara, Dave Chinner, J. Bruce Fields, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Sean Hefty, Jeff Layton, Marek Szyprowski, Ashok Raj, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Joerg Roedel, Doug Ledford, Christoph Hellwig, Linus Torvalds, Jeff Moyer, Ross Zwisler, Hal Rosenstock, Arnd Bergmann, Robin Murphy, Alexander Viro, Andy Lutomirski Changes since v6 [1]: * Abandon the concept of immutable files and rework the implementation to reuse same FL_LAYOUT file lease mechanism that coordinates pnfsd layouts vs local filesystem changes. This establishes an interface where the kernel is always in control of the block-map and is free to invalidate MAP_DIRECT mappings when a lease breaker arrives. (Christoph) * Introduce a new ->mmap_validate() file operation since we need both the original @flags and @fd passed to mmap(2) to setup a MAP_DIRECT mapping. * Introduce a ->lease_direct() vm operation to allow the RDMA core to safely register memory against DAX and tear down the mapping when the lease is broken. This can be reused by any sub-system that follows a memory registration semantic. [1]: https://lkml.org/lkml/2017/8/23/754 --- MAP_DIRECT is a mechanism that allows an application to establish a mapping where the kernel will not change the block-map, or otherwise dirty the block-map metadata of a file without notification. It supports a "flush from userspace" model where persistent memory applications can bypass the overhead of ongoing coordination of writes with the filesystem, and it provides safety to RDMA operations involving DAX mappings. The kernel always has the ability to revoke access and convert the file back to normal operation after performing a "lease break". Similar to fcntl leases, there is no way for userspace to to cancel the lease break process once it has started, it can only delay it via the /proc/sys/fs/lease-break-time setting. MAP_DIRECT enables XFS to supplant the device-dax interface for mmap-write access to persistent memory with no ongoing coordination with the filesystem via fsync/msync syscalls. --- Dan Williams (12): mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags fs, mm: pass fd to ->mmap_validate() fs: introduce i_mapdcount fs: MAP_DIRECT core xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT xfs: wire up MAP_DIRECT dma-mapping: introduce dma_has_iommu() fs, mapdirect: introduce ->lease_direct() xfs: wire up ->lease_direct() device-dax: wire up ->lease_direct() IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings tools/testing/nvdimm: enable rdma unit tests arch/alpha/include/uapi/asm/mman.h | 1 arch/mips/include/uapi/asm/mman.h | 1 arch/mips/kernel/vdso.c | 2 arch/parisc/include/uapi/asm/mman.h | 1 arch/tile/mm/elf.c | 3 arch/x86/mm/mpx.c | 3 arch/xtensa/include/uapi/asm/mman.h | 1 drivers/base/dma-mapping.c | 10 + drivers/dax/device.c | 4 drivers/infiniband/core/umem.c | 90 ++++++- drivers/iommu/amd_iommu.c | 6 drivers/iommu/intel-iommu.c | 6 fs/Kconfig | 4 fs/Makefile | 1 fs/aio.c | 2 fs/mapdirect.c | 349 ++++++++++++++++++++++++++ fs/xfs/Kconfig | 4 fs/xfs/Makefile | 1 fs/xfs/xfs_file.c | 130 ++++++++++ fs/xfs/xfs_iomap.c | 9 + fs/xfs/xfs_layout.c | 42 +++ fs/xfs/xfs_layout.h | 13 + fs/xfs/xfs_pnfs.c | 30 -- fs/xfs/xfs_pnfs.h | 10 - include/linux/dma-mapping.h | 3 include/linux/fs.h | 33 ++ include/linux/mapdirect.h | 68 +++++ include/linux/mm.h | 15 + include/linux/mman.h | 42 +++ include/rdma/ib_umem.h | 8 + include/uapi/asm-generic/mman-common.h | 1 include/uapi/asm-generic/mman.h | 1 ipc/shm.c | 3 mm/internal.h | 2 mm/mmap.c | 28 ++ mm/nommu.c | 5 mm/util.c | 7 - tools/include/uapi/asm-generic/mman-common.h | 1 tools/testing/nvdimm/Kbuild | 31 ++ tools/testing/nvdimm/config_check.c | 2 tools/testing/nvdimm/test/iomap.c | 6 41 files changed, 906 insertions(+), 73 deletions(-) create mode 100644 fs/mapdirect.c create mode 100644 fs/xfs/xfs_layout.c create mode 100644 fs/xfs/xfs_layout.h create mode 100644 include/linux/mapdirect.h -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 49+ messages in thread
* [PATCH v7 03/12] fs: introduce i_mapdcount 2017-10-06 22:35 [PATCH v7 00/12] MAP_DIRECT for DAX RDMA and userspace flush Dan Williams @ 2017-10-06 22:35 ` Dan Williams [not found] ` <150732933283.22363.570426117546397495.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> 2017-10-06 22:35 ` [PATCH v7 05/12] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT Dan Williams ` (6 subsequent siblings) 7 siblings, 1 reply; 49+ messages in thread From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw) To: linux-nvdimm Cc: linux-xfs, Jan Kara, Darrick J. Wong, linux-rdma, linux-api, Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm, Jeff Moyer, linux-fsdevel, Jeff Layton, Ross Zwisler When ->iomap_begin() sees this count being non-zero and determines that the block map of the file needs to be modified to satisfy the I/O request it will instead return an error. This is needed for MAP_DIRECT where, due to locking constraints, we can't rely on xfs_break_layouts() to protect against allocating write-faults either from the process that setup the MAP_DIRECT mapping nor other processes that have the file mapped. xfs_break_layouts() requires XFS_IOLOCK which is problematic to mix with the XFS_MMAPLOCK in the fault path. Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jeff Layton <jlayton@poochiereds.net> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- fs/xfs/xfs_iomap.c | 9 +++++++++ include/linux/fs.h | 31 +++++++++++++++++++++++++++++++ 2 files changed, 40 insertions(+) diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index a1909bc064e9..6816f8ebbdcf 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -1053,6 +1053,15 @@ xfs_file_iomap_begin( goto out_unlock; } /* + * If a file has MAP_DIRECT mappings disable block map + * updates. This should only effect mmap write faults as + * other paths are protected by an FL_LAYOUT lease. + */ + if (i_mapdcount_read(inode)) { + error = -ETXTBSY; + goto out_unlock; + } + /* * We cap the maximum length we map here to MAX_WRITEBACK_PAGES * pages to keep the chunks of work done where somewhat symmetric * with the work writeback does. This is a completely arbitrary diff --git a/include/linux/fs.h b/include/linux/fs.h index c2b9bf3dc4e9..f83871b188ff 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -642,6 +642,9 @@ struct inode { atomic_t i_count; atomic_t i_dio_count; atomic_t i_writecount; +#ifdef CONFIG_FS_DAX + atomic_t i_mapdcount; /* count of MAP_DIRECT vmas */ +#endif #ifdef CONFIG_IMA atomic_t i_readcount; /* struct files open RO */ #endif @@ -2784,6 +2787,34 @@ static inline void i_readcount_inc(struct inode *inode) return; } #endif + +#ifdef CONFIG_FS_DAX +static inline void i_mapdcount_dec(struct inode *inode) +{ + BUG_ON(!atomic_read(&inode->i_mapdcount)); + atomic_dec(&inode->i_mapdcount); +} +static inline void i_mapdcount_inc(struct inode *inode) +{ + atomic_inc(&inode->i_mapdcount); +} +static inline int i_mapdcount_read(struct inode *inode) +{ + return atomic_read(&inode->i_mapdcount); +} +#else +static inline void i_mapdcount_dec(struct inode *inode) +{ +} +static inline void i_mapdcount_inc(struct inode *inode) +{ +} +static inline int i_mapdcount_read(struct inode *inode) +{ + return 0; +} +#endif + extern int do_pipe_flags(int *, int); #define __kernel_read_file_id(id) \ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 49+ messages in thread
[parent not found: <150732933283.22363.570426117546397495.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH v7 03/12] fs: introduce i_mapdcount [not found] ` <150732933283.22363.570426117546397495.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> @ 2017-10-09 3:08 ` Dave Chinner 0 siblings, 0 replies; 49+ messages in thread From: Dave Chinner @ 2017-10-09 3:08 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, J. Bruce Fields, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Jeff Moyer, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jeff Layton, Ross Zwisler On Fri, Oct 06, 2017 at 03:35:32PM -0700, Dan Williams wrote: > When ->iomap_begin() sees this count being non-zero and determines that > the block map of the file needs to be modified to satisfy the I/O > request it will instead return an error. This is needed for MAP_DIRECT > where, due to locking constraints, we can't rely on xfs_break_layouts() > to protect against allocating write-faults either from the process that > setup the MAP_DIRECT mapping nor other processes that have the file > mapped. xfs_break_layouts() requires XFS_IOLOCK which is problematic to > mix with the XFS_MMAPLOCK in the fault path. > > Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > Cc: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> > Cc: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> > Cc: "Darrick J. Wong" <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> > Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > Cc: Jeff Layton <jlayton-vpEMnDpepFuMZCB2o+C8xQ@public.gmane.org> > Cc: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> > Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> > --- > fs/xfs/xfs_iomap.c | 9 +++++++++ > include/linux/fs.h | 31 +++++++++++++++++++++++++++++++ > 2 files changed, 40 insertions(+) > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index a1909bc064e9..6816f8ebbdcf 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -1053,6 +1053,15 @@ xfs_file_iomap_begin( > goto out_unlock; > } > /* > + * If a file has MAP_DIRECT mappings disable block map > + * updates. This should only effect mmap write faults as > + * other paths are protected by an FL_LAYOUT lease. > + */ > + if (i_mapdcount_read(inode)) { > + error = -ETXTBSY; > + goto out_unlock; > + } That looks really fragile. For one, it's going to miss modifications to reflinked files altogether. Ignoring that, however, I don't want to have to care one bit about the internals of the MAP_DIRECT implementation in the filesystem code. Hide it behind something with an obvious name that returns the appropriate error and the filesystem code becomes self documenting: if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, &imap, nimaps)) { ..... error = iomap_can_allocate(inode); if (error) goto out_unlock; Then you can put all the MAP_DIRECT stuff and the comments explaining what is does inside the generic function that determines if we are allowed to allocate on that inode or not. > + /* > * We cap the maximum length we map here to MAX_WRITEBACK_PAGES > * pages to keep the chunks of work done where somewhat symmetric > * with the work writeback does. This is a completely arbitrary > diff --git a/include/linux/fs.h b/include/linux/fs.h > index c2b9bf3dc4e9..f83871b188ff 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -642,6 +642,9 @@ struct inode { > atomic_t i_count; > atomic_t i_dio_count; > atomic_t i_writecount; > +#ifdef CONFIG_FS_DAX > + atomic_t i_mapdcount; /* count of MAP_DIRECT vmas */ > +#endif Is there any way to avoid growing the struct inode for this? Cheers, Dave. -- Dave Chinner david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 49+ messages in thread
* [PATCH v7 05/12] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT 2017-10-06 22:35 [PATCH v7 00/12] MAP_DIRECT for DAX RDMA and userspace flush Dan Williams 2017-10-06 22:35 ` [PATCH v7 03/12] fs: introduce i_mapdcount Dan Williams @ 2017-10-06 22:35 ` Dan Williams 2017-10-06 22:35 ` [PATCH v7 06/12] xfs: wire up MAP_DIRECT Dan Williams ` (5 subsequent siblings) 7 siblings, 0 replies; 49+ messages in thread From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw) To: linux-nvdimm Cc: Jan Kara, Darrick J. Wong, linux-rdma, linux-api, Dave Chinner, linux-xfs, linux-mm, Jeff Moyer, linux-fsdevel, Ross Zwisler, Christoph Hellwig Move xfs_break_layouts() to its own compilation unit so that it can be used for both pnfs layouts and MAP_DIRECT mappings. Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- fs/xfs/Kconfig | 4 ++++ fs/xfs/Makefile | 1 + fs/xfs/xfs_layout.c | 42 ++++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_layout.h | 13 +++++++++++++ fs/xfs/xfs_pnfs.c | 30 ------------------------------ fs/xfs/xfs_pnfs.h | 10 ++-------- 6 files changed, 62 insertions(+), 38 deletions(-) create mode 100644 fs/xfs/xfs_layout.c create mode 100644 fs/xfs/xfs_layout.h diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig index 1b98cfa342ab..f62fc6629abb 100644 --- a/fs/xfs/Kconfig +++ b/fs/xfs/Kconfig @@ -109,3 +109,7 @@ config XFS_ASSERT_FATAL result in warnings. This behavior can be modified at runtime via sysfs. + +config XFS_LAYOUT + def_bool y + depends on EXPORTFS_BLOCK_OPS diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index a6e955bfead8..d44135107490 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -135,3 +135,4 @@ xfs-$(CONFIG_XFS_POSIX_ACL) += xfs_acl.o xfs-$(CONFIG_SYSCTL) += xfs_sysctl.o xfs-$(CONFIG_COMPAT) += xfs_ioctl32.o xfs-$(CONFIG_EXPORTFS_BLOCK_OPS) += xfs_pnfs.o +xfs-$(CONFIG_XFS_LAYOUT) += xfs_layout.o diff --git a/fs/xfs/xfs_layout.c b/fs/xfs/xfs_layout.c new file mode 100644 index 000000000000..71d95e1a910a --- /dev/null +++ b/fs/xfs/xfs_layout.c @@ -0,0 +1,42 @@ +/* + * Copyright (c) 2014 Christoph Hellwig. + */ +#include "xfs.h" +#include "xfs_format.h" +#include "xfs_log_format.h" +#include "xfs_trans_resv.h" +#include "xfs_sb.h" +#include "xfs_mount.h" +#include "xfs_inode.h" + +#include <linux/fs.h> + +/* + * Ensure that we do not have any outstanding pNFS layouts that can be used by + * clients to directly read from or write to this inode. This must be called + * before every operation that can remove blocks from the extent map. + * Additionally we call it during the write operation, where aren't concerned + * about exposing unallocated blocks but just want to provide basic + * synchronization between a local writer and pNFS clients. mmap writes would + * also benefit from this sort of synchronization, but due to the tricky locking + * rules in the page fault path we don't bother. + */ +int +xfs_break_layouts( + struct inode *inode, + uint *iolock) +{ + struct xfs_inode *ip = XFS_I(inode); + int error; + + ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)); + + while ((error = break_layout(inode, false) == -EWOULDBLOCK)) { + xfs_iunlock(ip, *iolock); + error = break_layout(inode, true); + *iolock = XFS_IOLOCK_EXCL; + xfs_ilock(ip, *iolock); + } + + return error; +} diff --git a/fs/xfs/xfs_layout.h b/fs/xfs/xfs_layout.h new file mode 100644 index 000000000000..f848ee78cc93 --- /dev/null +++ b/fs/xfs/xfs_layout.h @@ -0,0 +1,13 @@ +#ifndef _XFS_LAYOUT_H +#define _XFS_LAYOUT_H 1 + +#ifdef CONFIG_XFS_LAYOUT +int xfs_break_layouts(struct inode *inode, uint *iolock); +#else +static inline int +xfs_break_layouts(struct inode *inode, uint *iolock) +{ + return 0; +} +#endif /* CONFIG_XFS_LAYOUT */ +#endif /* _XFS_LAYOUT_H */ diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c index 2f2dc3c09ad0..8ec72220e73b 100644 --- a/fs/xfs/xfs_pnfs.c +++ b/fs/xfs/xfs_pnfs.c @@ -20,36 +20,6 @@ #include "xfs_pnfs.h" /* - * Ensure that we do not have any outstanding pNFS layouts that can be used by - * clients to directly read from or write to this inode. This must be called - * before every operation that can remove blocks from the extent map. - * Additionally we call it during the write operation, where aren't concerned - * about exposing unallocated blocks but just want to provide basic - * synchronization between a local writer and pNFS clients. mmap writes would - * also benefit from this sort of synchronization, but due to the tricky locking - * rules in the page fault path we don't bother. - */ -int -xfs_break_layouts( - struct inode *inode, - uint *iolock) -{ - struct xfs_inode *ip = XFS_I(inode); - int error; - - ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)); - - while ((error = break_layout(inode, false) == -EWOULDBLOCK)) { - xfs_iunlock(ip, *iolock); - error = break_layout(inode, true); - *iolock = XFS_IOLOCK_EXCL; - xfs_ilock(ip, *iolock); - } - - return error; -} - -/* * Get a unique ID including its location so that the client can identify * the exported device. */ diff --git a/fs/xfs/xfs_pnfs.h b/fs/xfs/xfs_pnfs.h index b587cb99b2b7..4135b2482697 100644 --- a/fs/xfs/xfs_pnfs.h +++ b/fs/xfs/xfs_pnfs.h @@ -1,19 +1,13 @@ #ifndef _XFS_PNFS_H #define _XFS_PNFS_H 1 +#include "xfs_layout.h" + #ifdef CONFIG_EXPORTFS_BLOCK_OPS int xfs_fs_get_uuid(struct super_block *sb, u8 *buf, u32 *len, u64 *offset); int xfs_fs_map_blocks(struct inode *inode, loff_t offset, u64 length, struct iomap *iomap, bool write, u32 *device_generation); int xfs_fs_commit_blocks(struct inode *inode, struct iomap *maps, int nr_maps, struct iattr *iattr); - -int xfs_break_layouts(struct inode *inode, uint *iolock); -#else -static inline int -xfs_break_layouts(struct inode *inode, uint *iolock) -{ - return 0; -} #endif /* CONFIG_EXPORTFS_BLOCK_OPS */ #endif /* _XFS_PNFS_H */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v7 06/12] xfs: wire up MAP_DIRECT 2017-10-06 22:35 [PATCH v7 00/12] MAP_DIRECT for DAX RDMA and userspace flush Dan Williams 2017-10-06 22:35 ` [PATCH v7 03/12] fs: introduce i_mapdcount Dan Williams 2017-10-06 22:35 ` [PATCH v7 05/12] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT Dan Williams @ 2017-10-06 22:35 ` Dan Williams 2017-10-09 3:40 ` Dave Chinner 2017-10-06 22:35 ` [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() Dan Williams ` (4 subsequent siblings) 7 siblings, 1 reply; 49+ messages in thread From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw) To: linux-nvdimm Cc: linux-xfs, Jan Kara, Arnd Bergmann, Darrick J. Wong, linux-rdma, linux-api, Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm, Jeff Moyer, Alexander Viro, linux-fsdevel, Jeff Layton, Ross Zwisler MAP_DIRECT is an mmap(2) flag with the following semantics: MAP_DIRECT When specified with MAP_SHARED_VALIDATE, sets up a file lease with the same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease is broken when a "lease breaker" attempts to write(2), change the block map (fallocate), or change the size of the file. Otherwise the mechanism of a lease break is identical to the typical lease break case where the lease needs to be removed (munmap) within the number of seconds specified by /proc/sys/fs/lease-break-time. If the lease holder fails to remove the lease in time the kernel will invalidate the mapping and force all future accesses to the mapping to trigger SIGBUS. In addition to lease break timeouts causing faults in the mapping to result in SIGBUS, other states of the file will trigger SIGBUS at fault time: * The file is not DAX capable * The file has reflinked (copy-on-write) blocks * The fault would trigger the filesystem to allocate blocks * The fault would trigger the filesystem to perform extent conversion In other words, MAP_DIRECT expects and enforces a fully allocated file where faults can be satisfied without modifying block map metadata. An unprivileged process may establish a MAP_DIRECT mapping on a file whose UID (owner) matches the filesystem UID of the process. A process with the CAP_LEASE capability may establish a MAP_DIRECT mapping on arbitrary files ERRORS EACCES Beyond the typical mmap(2) conditions that trigger EACCES MAP_DIRECT also requires the permission to set a file lease. EOPNOTSUPP The filesystem explicitly does not support the flag SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that might require block-map updates, or the lease timed out and the kernel invalidated the mapping. Cc: Jan Kara <jack@suse.cz> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jeff Layton <jlayton@poochiereds.net> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- fs/xfs/Kconfig | 2 - fs/xfs/xfs_file.c | 102 +++++++++++++++++++++++++++++++++++++++ include/linux/mman.h | 3 + include/uapi/asm-generic/mman.h | 1 4 files changed, 106 insertions(+), 2 deletions(-) diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig index f62fc6629abb..f8765653a438 100644 --- a/fs/xfs/Kconfig +++ b/fs/xfs/Kconfig @@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL config XFS_LAYOUT def_bool y - depends on EXPORTFS_BLOCK_OPS + depends on EXPORTFS_BLOCK_OPS || FS_DAX diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index ebdd0bd2b261..e35518600e28 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -40,12 +40,22 @@ #include "xfs_iomap.h" #include "xfs_reflink.h" +#include <linux/mman.h> #include <linux/dcache.h> #include <linux/falloc.h> #include <linux/pagevec.h> +#include <linux/mapdirect.h> #include <linux/backing-dev.h> static const struct vm_operations_struct xfs_file_vm_ops; +static const struct vm_operations_struct xfs_file_vm_direct_ops; + +static inline bool +is_xfs_map_direct( + struct vm_area_struct *vma) +{ + return vma->vm_ops == &xfs_file_vm_direct_ops; +} /* * Clear the specified ranges to zero through either the pagecache or DAX. @@ -1008,6 +1018,26 @@ xfs_file_llseek( return vfs_setpos(file, offset, inode->i_sb->s_maxbytes); } +static int +xfs_vma_checks( + struct vm_area_struct *vma, + struct inode *inode) +{ + if (!is_xfs_map_direct(vma)) + return 0; + + if (!is_map_direct_valid(vma->vm_private_data)) + return VM_FAULT_SIGBUS; + + if (xfs_is_reflink_inode(XFS_I(inode))) + return VM_FAULT_SIGBUS; + + if (!IS_DAX(inode)) + return VM_FAULT_SIGBUS; + + return 0; +} + /* * Locking for serialisation of IO during page faults. This results in a lock * ordering of: @@ -1024,6 +1054,7 @@ __xfs_filemap_fault( enum page_entry_size pe_size, bool write_fault) { + struct vm_area_struct *vma = vmf->vma; struct inode *inode = file_inode(vmf->vma->vm_file); struct xfs_inode *ip = XFS_I(inode); int ret; @@ -1032,10 +1063,14 @@ __xfs_filemap_fault( if (write_fault) { sb_start_pagefault(inode->i_sb); - file_update_time(vmf->vma->vm_file); + file_update_time(vma->vm_file); } xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); + ret = xfs_vma_checks(vma, inode); + if (ret) + goto out_unlock; + if (IS_DAX(inode)) { ret = dax_iomap_fault(vmf, pe_size, &xfs_iomap_ops); } else { @@ -1044,6 +1079,8 @@ __xfs_filemap_fault( else ret = filemap_fault(vmf); } + +out_unlock: xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); if (write_fault) @@ -1115,6 +1152,31 @@ xfs_filemap_pfn_mkwrite( } +static void +xfs_filemap_direct_open( + struct vm_area_struct *vma) +{ + get_map_direct_vma(vma->vm_private_data); +} + +static void +xfs_filemap_direct_close( + struct vm_area_struct *vma) +{ + put_map_direct_vma(vma->vm_private_data); +} + +static const struct vm_operations_struct xfs_file_vm_direct_ops = { + .fault = xfs_filemap_fault, + .huge_fault = xfs_filemap_huge_fault, + .map_pages = filemap_map_pages, + .page_mkwrite = xfs_filemap_page_mkwrite, + .pfn_mkwrite = xfs_filemap_pfn_mkwrite, + + .open = xfs_filemap_direct_open, + .close = xfs_filemap_direct_close, +}; + static const struct vm_operations_struct xfs_file_vm_ops = { .fault = xfs_filemap_fault, .huge_fault = xfs_filemap_huge_fault, @@ -1135,6 +1197,43 @@ xfs_file_mmap( return 0; } +#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_DIRECT) + +STATIC int +xfs_file_mmap_validate( + struct file *filp, + struct vm_area_struct *vma, + unsigned long map_flags, + int fd) +{ + struct inode *inode = file_inode(filp); + struct xfs_inode *ip = XFS_I(inode); + struct map_direct_state *mds; + + if (map_flags & ~(XFS_MAP_SUPPORTED)) + return -EOPNOTSUPP; + + if ((map_flags & MAP_DIRECT) == 0) + return xfs_file_mmap(filp, vma); + + file_accessed(filp); + vma->vm_ops = &xfs_file_vm_direct_ops; + if (IS_DAX(inode)) + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; + + mds = map_direct_register(fd, vma); + if (IS_ERR(mds)) + return PTR_ERR(mds); + + /* flush in-flight faults */ + xfs_ilock(ip, XFS_MMAPLOCK_EXCL); + xfs_iunlock(ip, XFS_MMAPLOCK_EXCL); + + vma->vm_private_data = mds; + + return 0; +} + const struct file_operations xfs_file_operations = { .llseek = xfs_file_llseek, .read_iter = xfs_file_read_iter, @@ -1146,6 +1245,7 @@ const struct file_operations xfs_file_operations = { .compat_ioctl = xfs_file_compat_ioctl, #endif .mmap = xfs_file_mmap, + .mmap_validate = xfs_file_mmap_validate, .open = xfs_file_open, .release = xfs_file_release, .fsync = xfs_file_fsync, diff --git a/include/linux/mman.h b/include/linux/mman.h index 94b63b4d71ff..fab393a9dda9 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -20,6 +20,9 @@ #ifndef MAP_HUGE_1GB #define MAP_HUGE_1GB 0 #endif +#ifndef MAP_DIRECT +#define MAP_DIRECT 0 +#endif #ifndef MAP_UNINITIALIZED #define MAP_UNINITIALIZED 0 #endif diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h index 7162cd4cca73..c916f22008e0 100644 --- a/include/uapi/asm-generic/mman.h +++ b/include/uapi/asm-generic/mman.h @@ -12,6 +12,7 @@ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ #define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB 0x40000 /* create a huge page mapping */ +#define MAP_DIRECT 0x80000 /* leased block map (layout) for DAX */ /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: [PATCH v7 06/12] xfs: wire up MAP_DIRECT 2017-10-06 22:35 ` [PATCH v7 06/12] xfs: wire up MAP_DIRECT Dan Williams @ 2017-10-09 3:40 ` Dave Chinner 2017-10-09 17:08 ` Dan Williams 0 siblings, 1 reply; 49+ messages in thread From: Dave Chinner @ 2017-10-09 3:40 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, linux-xfs, Jan Kara, Arnd Bergmann, Darrick J. Wong, linux-rdma, linux-api, Christoph Hellwig, J. Bruce Fields, linux-mm, Jeff Moyer, Alexander Viro, linux-fsdevel, Jeff Layton, Ross Zwisler On Fri, Oct 06, 2017 at 03:35:49PM -0700, Dan Williams wrote: > MAP_DIRECT is an mmap(2) flag with the following semantics: > > MAP_DIRECT > When specified with MAP_SHARED_VALIDATE, sets up a file lease with the > same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease > is broken when a "lease breaker" attempts to write(2), change the block > map (fallocate), or change the size of the file. Otherwise the mechanism > of a lease break is identical to the typical lease break case where the > lease needs to be removed (munmap) within the number of seconds > specified by /proc/sys/fs/lease-break-time. If the lease holder fails to > remove the lease in time the kernel will invalidate the mapping and > force all future accesses to the mapping to trigger SIGBUS. > > In addition to lease break timeouts causing faults in the mapping to > result in SIGBUS, other states of the file will trigger SIGBUS at fault > time: > > * The file is not DAX capable > * The file has reflinked (copy-on-write) blocks > * The fault would trigger the filesystem to allocate blocks > * The fault would trigger the filesystem to perform extent conversion > > In other words, MAP_DIRECT expects and enforces a fully allocated file > where faults can be satisfied without modifying block map metadata. > > An unprivileged process may establish a MAP_DIRECT mapping on a file > whose UID (owner) matches the filesystem UID of the process. A process > with the CAP_LEASE capability may establish a MAP_DIRECT mapping on > arbitrary files > > ERRORS > EACCES Beyond the typical mmap(2) conditions that trigger EACCES > MAP_DIRECT also requires the permission to set a file lease. > > EOPNOTSUPP The filesystem explicitly does not support the flag > > SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that > might require block-map updates, or the lease timed out and the > kernel invalidated the mapping. > > Cc: Jan Kara <jack@suse.cz> > Cc: Arnd Bergmann <arnd@arndb.de> > Cc: Jeff Moyer <jmoyer@redhat.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Dave Chinner <david@fromorbit.com> > Cc: Alexander Viro <viro@zeniv.linux.org.uk> > Cc: "Darrick J. Wong" <darrick.wong@oracle.com> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > Cc: Jeff Layton <jlayton@poochiereds.net> > Cc: "J. Bruce Fields" <bfields@fieldses.org> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > fs/xfs/Kconfig | 2 - > fs/xfs/xfs_file.c | 102 +++++++++++++++++++++++++++++++++++++++ > include/linux/mman.h | 3 + > include/uapi/asm-generic/mman.h | 1 > 4 files changed, 106 insertions(+), 2 deletions(-) > > diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig > index f62fc6629abb..f8765653a438 100644 > --- a/fs/xfs/Kconfig > +++ b/fs/xfs/Kconfig > @@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL > > config XFS_LAYOUT > def_bool y > - depends on EXPORTFS_BLOCK_OPS > + depends on EXPORTFS_BLOCK_OPS || FS_DAX > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index ebdd0bd2b261..e35518600e28 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -40,12 +40,22 @@ > #include "xfs_iomap.h" > #include "xfs_reflink.h" > > +#include <linux/mman.h> > #include <linux/dcache.h> > #include <linux/falloc.h> > #include <linux/pagevec.h> > +#include <linux/mapdirect.h> > #include <linux/backing-dev.h> > > static const struct vm_operations_struct xfs_file_vm_ops; > +static const struct vm_operations_struct xfs_file_vm_direct_ops; > + > +static inline bool > +is_xfs_map_direct( > + struct vm_area_struct *vma) > +{ > + return vma->vm_ops == &xfs_file_vm_direct_ops; > +} Namespacing (xfs_vma_is_direct) and whitespace damage. > > /* > * Clear the specified ranges to zero through either the pagecache or DAX. > @@ -1008,6 +1018,26 @@ xfs_file_llseek( > return vfs_setpos(file, offset, inode->i_sb->s_maxbytes); > } > > +static int > +xfs_vma_checks( > + struct vm_area_struct *vma, > + struct inode *inode) Exactly what are we checking for - function name doesn't tell me, and there's no comments, either? > +{ > + if (!is_xfs_map_direct(vma)) > + return 0; > + > + if (!is_map_direct_valid(vma->vm_private_data)) > + return VM_FAULT_SIGBUS; > + > + if (xfs_is_reflink_inode(XFS_I(inode))) > + return VM_FAULT_SIGBUS; > + > + if (!IS_DAX(inode)) > + return VM_FAULT_SIGBUS; And how do we get is_xfs_map_direct() set to true if we don't have a DAX inode or the inode has shared extents? > + > + return 0; > +} > + > /* > * Locking for serialisation of IO during page faults. This results in a lock > * ordering of: > @@ -1024,6 +1054,7 @@ __xfs_filemap_fault( > enum page_entry_size pe_size, > bool write_fault) > { > + struct vm_area_struct *vma = vmf->vma; > struct inode *inode = file_inode(vmf->vma->vm_file); You missed this vmf->vma.... ..... > > +#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_DIRECT) > + > +STATIC int > +xfs_file_mmap_validate( > + struct file *filp, > + struct vm_area_struct *vma, > + unsigned long map_flags, > + int fd) > +{ > + struct inode *inode = file_inode(filp); > + struct xfs_inode *ip = XFS_I(inode); > + struct map_direct_state *mds; > + > + if (map_flags & ~(XFS_MAP_SUPPORTED)) > + return -EOPNOTSUPP; > + > + if ((map_flags & MAP_DIRECT) == 0) > + return xfs_file_mmap(filp, vma); > + > + file_accessed(filp); > + vma->vm_ops = &xfs_file_vm_direct_ops; > + if (IS_DAX(inode)) > + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; And if it isn't a DAX inode? what is MAP_DIRECT supposed to do then? > + mds = map_direct_register(fd, vma); > + if (IS_ERR(mds)) > + return PTR_ERR(mds); > + > + /* flush in-flight faults */ > + xfs_ilock(ip, XFS_MMAPLOCK_EXCL); > + xfs_iunlock(ip, XFS_MMAPLOCK_EXCL); Urk. That's nasty. And why is it even necessary? Please explain why this is necessary in the comment, because it's not at all obvious to me... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v7 06/12] xfs: wire up MAP_DIRECT 2017-10-09 3:40 ` Dave Chinner @ 2017-10-09 17:08 ` Dan Williams 2017-10-09 22:50 ` Dave Chinner 0 siblings, 1 reply; 49+ messages in thread From: Dan Williams @ 2017-10-09 17:08 UTC (permalink / raw) To: Dave Chinner Cc: J. Bruce Fields, Jan Kara, Arnd Bergmann, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Linux API, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Alexander Viro, linux-fsdevel, Jeff Layton, Christoph Hellwig On Sun, Oct 8, 2017 at 8:40 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote: Thanks for the review Dave. > On Fri, Oct 06, 2017 at 03:35:49PM -0700, Dan Williams wrote: >> MAP_DIRECT is an mmap(2) flag with the following semantics: >> >> MAP_DIRECT >> When specified with MAP_SHARED_VALIDATE, sets up a file lease with the >> same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease >> is broken when a "lease breaker" attempts to write(2), change the block >> map (fallocate), or change the size of the file. Otherwise the mechanism >> of a lease break is identical to the typical lease break case where the >> lease needs to be removed (munmap) within the number of seconds >> specified by /proc/sys/fs/lease-break-time. If the lease holder fails to >> remove the lease in time the kernel will invalidate the mapping and >> force all future accesses to the mapping to trigger SIGBUS. >> >> In addition to lease break timeouts causing faults in the mapping to >> result in SIGBUS, other states of the file will trigger SIGBUS at fault >> time: >> >> * The file is not DAX capable >> * The file has reflinked (copy-on-write) blocks >> * The fault would trigger the filesystem to allocate blocks >> * The fault would trigger the filesystem to perform extent conversion >> >> In other words, MAP_DIRECT expects and enforces a fully allocated file >> where faults can be satisfied without modifying block map metadata. >> >> An unprivileged process may establish a MAP_DIRECT mapping on a file >> whose UID (owner) matches the filesystem UID of the process. A process >> with the CAP_LEASE capability may establish a MAP_DIRECT mapping on >> arbitrary files >> >> ERRORS >> EACCES Beyond the typical mmap(2) conditions that trigger EACCES >> MAP_DIRECT also requires the permission to set a file lease. >> >> EOPNOTSUPP The filesystem explicitly does not support the flag >> >> SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that >> might require block-map updates, or the lease timed out and the >> kernel invalidated the mapping. >> >> Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> >> Cc: Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org> >> Cc: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >> Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> >> Cc: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> >> Cc: Alexander Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> >> Cc: "Darrick J. Wong" <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> >> Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> >> Cc: Jeff Layton <jlayton-vpEMnDpepFuMZCB2o+C8xQ@public.gmane.org> >> Cc: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> >> Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> >> --- >> fs/xfs/Kconfig | 2 - >> fs/xfs/xfs_file.c | 102 +++++++++++++++++++++++++++++++++++++++ >> include/linux/mman.h | 3 + >> include/uapi/asm-generic/mman.h | 1 >> 4 files changed, 106 insertions(+), 2 deletions(-) >> >> diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig >> index f62fc6629abb..f8765653a438 100644 >> --- a/fs/xfs/Kconfig >> +++ b/fs/xfs/Kconfig >> @@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL >> >> config XFS_LAYOUT >> def_bool y >> - depends on EXPORTFS_BLOCK_OPS >> + depends on EXPORTFS_BLOCK_OPS || FS_DAX >> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c >> index ebdd0bd2b261..e35518600e28 100644 >> --- a/fs/xfs/xfs_file.c >> +++ b/fs/xfs/xfs_file.c >> @@ -40,12 +40,22 @@ >> #include "xfs_iomap.h" >> #include "xfs_reflink.h" >> >> +#include <linux/mman.h> >> #include <linux/dcache.h> >> #include <linux/falloc.h> >> #include <linux/pagevec.h> >> +#include <linux/mapdirect.h> >> #include <linux/backing-dev.h> >> >> static const struct vm_operations_struct xfs_file_vm_ops; >> +static const struct vm_operations_struct xfs_file_vm_direct_ops; >> + >> +static inline bool >> +is_xfs_map_direct( >> + struct vm_area_struct *vma) >> +{ >> + return vma->vm_ops == &xfs_file_vm_direct_ops; >> +} > > Namespacing (xfs_vma_is_direct) and whitespace damage. Will fix. > >> >> /* >> * Clear the specified ranges to zero through either the pagecache or DAX. >> @@ -1008,6 +1018,26 @@ xfs_file_llseek( >> return vfs_setpos(file, offset, inode->i_sb->s_maxbytes); >> } >> >> +static int >> +xfs_vma_checks( >> + struct vm_area_struct *vma, >> + struct inode *inode) > > Exactly what are we checking for - function name doesn't tell me, > and there's no comments, either? Ok, I'll improve this. > >> +{ >> + if (!is_xfs_map_direct(vma)) >> + return 0; >> + >> + if (!is_map_direct_valid(vma->vm_private_data)) >> + return VM_FAULT_SIGBUS; >> + >> + if (xfs_is_reflink_inode(XFS_I(inode))) >> + return VM_FAULT_SIGBUS; >> + >> + if (!IS_DAX(inode)) >> + return VM_FAULT_SIGBUS; > > And how do we get is_xfs_map_direct() set to true if we don't have a > DAX inode or the inode has shared extents? So, this was my way of trying to satisfy the request you made here: https://lkml.org/lkml/2017/8/11/876 i.e. allow MAP_DIRECT on non-dax files to enable a use case of freezing the block-map to examine which file extents are linked. If you don't want to use MAP_DIRECT for this, we can move these checks to mmap time. > >> + >> + return 0; >> +} >> + >> /* >> * Locking for serialisation of IO during page faults. This results in a lock >> * ordering of: >> @@ -1024,6 +1054,7 @@ __xfs_filemap_fault( >> enum page_entry_size pe_size, >> bool write_fault) >> { >> + struct vm_area_struct *vma = vmf->vma; >> struct inode *inode = file_inode(vmf->vma->vm_file); > > You missed this vmf->vma.... > > ..... >> >> +#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_DIRECT) >> + >> +STATIC int >> +xfs_file_mmap_validate( >> + struct file *filp, >> + struct vm_area_struct *vma, >> + unsigned long map_flags, >> + int fd) >> +{ >> + struct inode *inode = file_inode(filp); >> + struct xfs_inode *ip = XFS_I(inode); >> + struct map_direct_state *mds; >> + >> + if (map_flags & ~(XFS_MAP_SUPPORTED)) >> + return -EOPNOTSUPP; >> + >> + if ((map_flags & MAP_DIRECT) == 0) >> + return xfs_file_mmap(filp, vma); >> + >> + file_accessed(filp); >> + vma->vm_ops = &xfs_file_vm_direct_ops; >> + if (IS_DAX(inode)) >> + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; > > And if it isn't a DAX inode? what is MAP_DIRECT supposed to do then? In the non-DAX case it just takes the FL_LAYOUT file lease... although we could also just have an fcntl for that purpose. The use case of just freezing the block map does not need a mapping. >> + mds = map_direct_register(fd, vma); >> + if (IS_ERR(mds)) >> + return PTR_ERR(mds); >> + >> + /* flush in-flight faults */ >> + xfs_ilock(ip, XFS_MMAPLOCK_EXCL); >> + xfs_iunlock(ip, XFS_MMAPLOCK_EXCL); > > Urk. That's nasty. And why is it even necessary? Please explain why > this is necessary in the comment, because it's not at all obvious to > me... This is related to your other observation about i_mapdcount and adding an iomap_can_allocate() helper. I think I can clean both of these up by using a call to break_layout(inode, false) and bailing in ->iomap_begin() if it returns EWOULDBLOCK. This would also fix the current problem that allocating write-faults don't start the lease break process. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v7 06/12] xfs: wire up MAP_DIRECT 2017-10-09 17:08 ` Dan Williams @ 2017-10-09 22:50 ` Dave Chinner 0 siblings, 0 replies; 49+ messages in thread From: Dave Chinner @ 2017-10-09 22:50 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm@lists.01.org, linux-xfs, Jan Kara, Arnd Bergmann, Darrick J. Wong, linux-rdma, Linux API, Christoph Hellwig, J. Bruce Fields, Linux MM, Jeff Moyer, Alexander Viro, linux-fsdevel, Jeff Layton, Ross Zwisler On Mon, Oct 09, 2017 at 10:08:40AM -0700, Dan Williams wrote: > On Sun, Oct 8, 2017 at 8:40 PM, Dave Chinner <david@fromorbit.com> wrote: > >> > >> /* > >> * Clear the specified ranges to zero through either the pagecache or DAX. > >> @@ -1008,6 +1018,26 @@ xfs_file_llseek( > >> return vfs_setpos(file, offset, inode->i_sb->s_maxbytes); > >> } > >> > >> +static int > >> +xfs_vma_checks( > >> + struct vm_area_struct *vma, > >> + struct inode *inode) > > > > Exactly what are we checking for - function name doesn't tell me, > > and there's no comments, either? > > Ok, I'll improve this. > > > > >> +{ > >> + if (!is_xfs_map_direct(vma)) > >> + return 0; > >> + > >> + if (!is_map_direct_valid(vma->vm_private_data)) > >> + return VM_FAULT_SIGBUS; > >> + > >> + if (xfs_is_reflink_inode(XFS_I(inode))) > >> + return VM_FAULT_SIGBUS; > >> + > >> + if (!IS_DAX(inode)) > >> + return VM_FAULT_SIGBUS; > > > > And how do we get is_xfs_map_direct() set to true if we don't have a > > DAX inode or the inode has shared extents? > > So, this was my way of trying to satisfy the request you made here: > > https://lkml.org/lkml/2017/8/11/876 > > i.e. allow MAP_DIRECT on non-dax files to enable a use case of > freezing the block-map to examine which file extents are linked. If > you don't want to use MAP_DIRECT for this, we can move these checks to > mmap time. Ok, but I don't want to use mmap to deal with this, nor do I care whether DAX is in use or not. So I don't think this is really necessary for MAP_DIRECT. > >> +xfs_file_mmap_validate( > >> + struct file *filp, > >> + struct vm_area_struct *vma, > >> + unsigned long map_flags, > >> + int fd) > >> +{ > >> + struct inode *inode = file_inode(filp); > >> + struct xfs_inode *ip = XFS_I(inode); > >> + struct map_direct_state *mds; > >> + > >> + if (map_flags & ~(XFS_MAP_SUPPORTED)) > >> + return -EOPNOTSUPP; > >> + > >> + if ((map_flags & MAP_DIRECT) == 0) > >> + return xfs_file_mmap(filp, vma); > >> + > >> + file_accessed(filp); > >> + vma->vm_ops = &xfs_file_vm_direct_ops; > >> + if (IS_DAX(inode)) > >> + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; > > > > And if it isn't a DAX inode? what is MAP_DIRECT supposed to do then? > > In the non-DAX case it just takes the FL_LAYOUT file lease... although > we could also just have an fcntl for that purpose. The use case of > just freezing the block map does not need a mapping. RIght, so I think we should just add a fcntl for the non-DAX case I have in mind, and not complicate the MAP_DIRECT implementation right now. We can alsways extend the scope of MAP_DIRECT in future if we actually need to do so. > >> + mds = map_direct_register(fd, vma); > >> + if (IS_ERR(mds)) > >> + return PTR_ERR(mds); > >> + > >> + /* flush in-flight faults */ > >> + xfs_ilock(ip, XFS_MMAPLOCK_EXCL); > >> + xfs_iunlock(ip, XFS_MMAPLOCK_EXCL); > > > > Urk. That's nasty. And why is it even necessary? Please explain why > > this is necessary in the comment, because it's not at all obvious to > > me... > > This is related to your other observation about i_mapdcount and adding > an iomap_can_allocate() helper. I think I can clean both of these up > by using a call to break_layout(inode, false) and bailing in > ->iomap_begin() if it returns EWOULDBLOCK. This would also fix the > current problem that allocating write-faults don't start the lease > break process. OK. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 49+ messages in thread
* [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() 2017-10-06 22:35 [PATCH v7 00/12] MAP_DIRECT for DAX RDMA and userspace flush Dan Williams ` (2 preceding siblings ...) 2017-10-06 22:35 ` [PATCH v7 06/12] xfs: wire up MAP_DIRECT Dan Williams @ 2017-10-06 22:35 ` Dan Williams 2017-10-06 22:45 ` David Woodhouse ` (2 more replies) [not found] ` <150732931273.22363.8436792888326501071.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> ` (3 subsequent siblings) 7 siblings, 3 replies; 49+ messages in thread From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw) To: linux-nvdimm Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma, Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs, linux-mm, Jeff Moyer, linux-api, linux-fsdevel, Ross Zwisler, David Woodhouse, Robin Murphy, Christoph Hellwig, Marek Szyprowski Add a helper to determine if the dma mappings set up for a given device are backed by an iommu. In particular, this lets code paths know that a dma_unmap operation will revoke access to memory if the device can not otherwise be quiesced. The need for this knowledge is driven by a need to make RDMA transfers to DAX mappings safe. If the DAX file's block map changes we need to be to reliably stop accesses to blocks that have been freed or re-assigned to a new file. Since PMEM+DAX is currently only enabled for x86, we only update the x86 iommu drivers. Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Ashok Raj <ashok.raj@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/base/dma-mapping.c | 10 ++++++++++ drivers/iommu/amd_iommu.c | 6 ++++++ drivers/iommu/intel-iommu.c | 6 ++++++ include/linux/dma-mapping.h | 3 +++ 4 files changed, 25 insertions(+) diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c index e584eddef0a7..e1b5f103d90e 100644 --- a/drivers/base/dma-mapping.c +++ b/drivers/base/dma-mapping.c @@ -369,3 +369,13 @@ void dma_deconfigure(struct device *dev) of_dma_deconfigure(dev); acpi_dma_deconfigure(dev); } + +bool dma_has_iommu(struct device *dev) +{ + const struct dma_map_ops *ops = get_dma_ops(dev); + + if (ops && ops->has_iommu) + return ops->has_iommu(dev); + return false; +} +EXPORT_SYMBOL(dma_has_iommu); diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c index 51f8215877f5..873f899fcf57 100644 --- a/drivers/iommu/amd_iommu.c +++ b/drivers/iommu/amd_iommu.c @@ -2271,6 +2271,11 @@ static struct protection_domain *get_domain(struct device *dev) return domain; } +static bool amd_dma_has_iommu(struct device *dev) +{ + return !IS_ERR(get_domain(dev)); +} + static void update_device_table(struct protection_domain *domain) { struct iommu_dev_data *dev_data; @@ -2689,6 +2694,7 @@ static const struct dma_map_ops amd_iommu_dma_ops = { .unmap_sg = unmap_sg, .dma_supported = amd_iommu_dma_supported, .mapping_error = amd_iommu_mapping_error, + .has_iommu = amd_dma_has_iommu, }; static int init_reserved_iova_ranges(void) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 6784a05dd6b2..243ef42fdad4 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -3578,6 +3578,11 @@ static int iommu_no_mapping(struct device *dev) return 0; } +static bool intel_dma_has_iommu(struct device *dev) +{ + return !iommu_no_mapping(dev); +} + static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr, size_t size, int dir, u64 dma_mask) { @@ -3872,6 +3877,7 @@ const struct dma_map_ops intel_dma_ops = { .map_page = intel_map_page, .unmap_page = intel_unmap_page, .mapping_error = intel_mapping_error, + .has_iommu = intel_dma_has_iommu, #ifdef CONFIG_X86 .dma_supported = x86_dma_supported, #endif diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 29ce9815da87..659f122c18f5 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -128,6 +128,7 @@ struct dma_map_ops { enum dma_data_direction dir); int (*mapping_error)(struct device *dev, dma_addr_t dma_addr); int (*dma_supported)(struct device *dev, u64 mask); + bool (*has_iommu)(struct device *dev); #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK u64 (*get_required_mask)(struct device *dev); #endif @@ -221,6 +222,8 @@ static inline const struct dma_map_ops *get_dma_ops(struct device *dev) } #endif +extern bool dma_has_iommu(struct device *dev); + static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr, size_t size, enum dma_data_direction dir, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() 2017-10-06 22:35 ` [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() Dan Williams @ 2017-10-06 22:45 ` David Woodhouse [not found] ` <1507329939.29211.434.camel-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> 2017-10-08 3:45 ` [PATCH v8] dma-mapping: introduce dma_get_iommu_domain() Dan Williams [not found] ` <150732935473.22363.1853399637339625023.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> 2 siblings, 1 reply; 49+ messages in thread From: David Woodhouse @ 2017-10-06 22:45 UTC (permalink / raw) To: Dan Williams, linux-nvdimm Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma, Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs, linux-mm, Jeff Moyer, linux-api, linux-fsdevel, Ross Zwisler, Robin Murphy, Christoph Hellwig, Marek Szyprowski [-- Attachment #1: Type: text/plain, Size: 685 bytes --] On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote: > Add a helper to determine if the dma mappings set up for a given device > are backed by an iommu. In particular, this lets code paths know that a > dma_unmap operation will revoke access to memory if the device can not > otherwise be quiesced. The need for this knowledge is driven by a need > to make RDMA transfers to DAX mappings safe. If the DAX file's block map > changes we need to be to reliably stop accesses to blocks that have been > freed or re-assigned to a new file. "a dma_unmap operation revoke access to memory"... but it's OK that the next *map* will give the same DMA address to someone else, right? [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 4938 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <1507329939.29211.434.camel-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>]
* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() [not found] ` <1507329939.29211.434.camel-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> @ 2017-10-06 22:52 ` Dan Williams [not found] ` <CAPcyv4jLj2WOgPA+B5eYME0uZyuoy3gp35w+4rCa_EZxm-QSKA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-10-06 23:12 ` Dan Williams 0 siblings, 2 replies; 49+ messages in thread From: Dan Williams @ 2017-10-06 22:52 UTC (permalink / raw) To: David Woodhouse Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman, Joerg Roedel, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org, Dave Chinner, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Linux API, linux-fsdevel, Robin Murphy, Christoph Hellwig, Marek Szyprowski On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse <dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: > On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote: >> Add a helper to determine if the dma mappings set up for a given device >> are backed by an iommu. In particular, this lets code paths know that a >> dma_unmap operation will revoke access to memory if the device can not >> otherwise be quiesced. The need for this knowledge is driven by a need >> to make RDMA transfers to DAX mappings safe. If the DAX file's block map >> changes we need to be to reliably stop accesses to blocks that have been >> freed or re-assigned to a new file. > > "a dma_unmap operation revoke access to memory"... but it's OK that the > next *map* will give the same DMA address to someone else, right? I'm assuming the next map will be to other physical addresses and a different requester device since the memory is still registered exclusively. ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <CAPcyv4jLj2WOgPA+B5eYME0uZyuoy3gp35w+4rCa_EZxm-QSKA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() [not found] ` <CAPcyv4jLj2WOgPA+B5eYME0uZyuoy3gp35w+4rCa_EZxm-QSKA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2017-10-06 23:10 ` David Woodhouse [not found] ` <1507331434.29211.439.camel-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: David Woodhouse @ 2017-10-06 23:10 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler, Robin Murphy, Christoph Hellwig, Marek Szyprowski [-- Attachment #1: Type: text/plain, Size: 1477 bytes --] On Fri, 2017-10-06 at 15:52 -0700, Dan Williams wrote: > On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse <dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: > > > > On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote: > > > > > > Add a helper to determine if the dma mappings set up for a given device > > > are backed by an iommu. In particular, this lets code paths know that a > > > dma_unmap operation will revoke access to memory if the device can not > > > otherwise be quiesced. The need for this knowledge is driven by a need > > > to make RDMA transfers to DAX mappings safe. If the DAX file's block map > > > changes we need to be to reliably stop accesses to blocks that have been > > > freed or re-assigned to a new file. > > "a dma_unmap operation revoke access to memory"... but it's OK that the > > next *map* will give the same DMA address to someone else, right? > > I'm assuming the next map will be to other physical addresses and a > different requester device since the memory is still registered > exclusively. I meant the next map for this device/group. It may well use the same virtual DMA address as the one you just unmapped, yet actually map to a different physical address. So if the DMA still occurs to the "old" address, that isn't revoked at all — it's just going to the wrong physical location. And if you are sure that the DMA will never happen, why do you need to revoke the mapping in the first place? [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 4938 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <1507331434.29211.439.camel-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>]
* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() [not found] ` <1507331434.29211.439.camel-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> @ 2017-10-06 23:15 ` Dan Williams 2017-10-07 11:08 ` David Woodhouse 0 siblings, 1 reply; 49+ messages in thread From: Dan Williams @ 2017-10-06 23:15 UTC (permalink / raw) To: David Woodhouse Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman, Joerg Roedel, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org, Dave Chinner, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Linux API, linux-fsdevel, Robin Murphy, Christoph Hellwig, Marek Szyprowski On Fri, Oct 6, 2017 at 4:10 PM, David Woodhouse <dwmw2@infradead.org> wrote: > On Fri, 2017-10-06 at 15:52 -0700, Dan Williams wrote: >> On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse <dwmw2@infradead.org> wrote: >> > >> > On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote: >> > > >> > > Add a helper to determine if the dma mappings set up for a given device >> > > are backed by an iommu. In particular, this lets code paths know that a >> > > dma_unmap operation will revoke access to memory if the device can not >> > > otherwise be quiesced. The need for this knowledge is driven by a need >> > > to make RDMA transfers to DAX mappings safe. If the DAX file's block map >> > > changes we need to be to reliably stop accesses to blocks that have been >> > > freed or re-assigned to a new file. >> > "a dma_unmap operation revoke access to memory"... but it's OK that the >> > next *map* will give the same DMA address to someone else, right? >> >> I'm assuming the next map will be to other physical addresses and a >> different requester device since the memory is still registered >> exclusively. > > I meant the next map for this device/group. > > It may well use the same virtual DMA address as the one you just > unmapped, yet actually map to a different physical address. So if the > DMA still occurs to the "old" address, that isn't revoked at all — it's > just going to the wrong physical location. > > And if you are sure that the DMA will never happen, why do you need to > revoke the mapping in the first place? Right, crossed mails. The semantic I want is that the IOVA is invalidated / starts throwing errors to the device because the address it thought it was talking to has been remapped in the file. Once userspace wakes up and responds to this invalidation event it can do the actual unmap to make the IOVA reusable again. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() 2017-10-06 23:15 ` Dan Williams @ 2017-10-07 11:08 ` David Woodhouse [not found] ` <1507374524.25529.13.camel-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: David Woodhouse @ 2017-10-07 11:08 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm@lists.01.org, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma, Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs, Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler, Robin Murphy, Christoph Hellwig, Marek Szyprowski [-- Attachment #1: Type: text/plain, Size: 808 bytes --] On Fri, 2017-10-06 at 16:15 -0700, Dan Williams wrote: > > Right, crossed mails. The semantic I want is that the IOVA is > invalidated / starts throwing errors to the device because the address > it thought it was talking to has been remapped in the file. Once > userspace wakes up and responds to this invalidation event it can do > the actual unmap to make the IOVA reusable again. So basically you want to unmap it by removing it from the page tables and flushing the IOTLB, but you want the IOVA to still be reserved. The normal device-facing DMA API doesn't give you that today. You could do it with the IOMMU API though — that one does let you manage the IOVA space yourself. You don't want that IOVA used again? Well don't use it as the IOVA in a subsequent iommu_map() call then :) [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 4938 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <1507374524.25529.13.camel-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>]
* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() [not found] ` <1507374524.25529.13.camel-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> @ 2017-10-07 23:33 ` Dan Williams 0 siblings, 0 replies; 49+ messages in thread From: Dan Williams @ 2017-10-07 23:33 UTC (permalink / raw) To: David Woodhouse Cc: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler, Robin Murphy, Christoph Hellwig, Marek Szyprowski On Sat, Oct 7, 2017 at 4:08 AM, David Woodhouse <dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: > On Fri, 2017-10-06 at 16:15 -0700, Dan Williams wrote: >> >> Right, crossed mails. The semantic I want is that the IOVA is >> invalidated / starts throwing errors to the device because the address >> it thought it was talking to has been remapped in the file. Once >> userspace wakes up and responds to this invalidation event it can do >> the actual unmap to make the IOVA reusable again. > > So basically you want to unmap it by removing it from the page tables > and flushing the IOTLB, but you want the IOVA to still be reserved. > > The normal device-facing DMA API doesn't give you that today. You could > do it with the IOMMU API though — that one does let you manage the IOVA > space yourself. You don't want that IOVA used again? Well don't use it > as the IOVA in a subsequent iommu_map() call then :) Ah, nice. So I think I'll just add a dma_get_iommu_domain() so the dma_ops implementation can exclude identity-mapped devices, and then iommu_unmap() does the rest. Thanks for the pointer. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() 2017-10-06 22:52 ` Dan Williams [not found] ` <CAPcyv4jLj2WOgPA+B5eYME0uZyuoy3gp35w+4rCa_EZxm-QSKA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2017-10-06 23:12 ` Dan Williams 1 sibling, 0 replies; 49+ messages in thread From: Dan Williams @ 2017-10-06 23:12 UTC (permalink / raw) To: David Woodhouse Cc: linux-nvdimm@lists.01.org, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma, Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs, Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler, Robin Murphy, Christoph Hellwig, Marek Szyprowski On Fri, Oct 6, 2017 at 3:52 PM, Dan Williams <dan.j.williams@intel.com> wrote: > On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse <dwmw2@infradead.org> wrote: >> On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote: >>> Add a helper to determine if the dma mappings set up for a given device >>> are backed by an iommu. In particular, this lets code paths know that a >>> dma_unmap operation will revoke access to memory if the device can not >>> otherwise be quiesced. The need for this knowledge is driven by a need >>> to make RDMA transfers to DAX mappings safe. If the DAX file's block map >>> changes we need to be to reliably stop accesses to blocks that have been >>> freed or re-assigned to a new file. >> >> "a dma_unmap operation revoke access to memory"... but it's OK that the >> next *map* will give the same DMA address to someone else, right? > > I'm assuming the next map will be to other physical addresses and a > different requester device since the memory is still registered > exclusively. [ chatted with Ashok ] Yes, it seems we need a way to pin that IOVA as in use, but invalidate it. Then we can wait for the unmap to occur when the memory is unregistered to avoid this IOVA reuse problem. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* [PATCH v8] dma-mapping: introduce dma_get_iommu_domain() 2017-10-06 22:35 ` [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() Dan Williams 2017-10-06 22:45 ` David Woodhouse @ 2017-10-08 3:45 ` Dan Williams [not found] ` <150743420333.12880.6968831423519457797.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> 2017-10-10 14:40 ` Raj, Ashok [not found] ` <150732935473.22363.1853399637339625023.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> 2 siblings, 2 replies; 49+ messages in thread From: Dan Williams @ 2017-10-08 3:45 UTC (permalink / raw) To: linux-nvdimm Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma, Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs, linux-mm, Jeff Moyer, linux-api, linux-fsdevel, Ross Zwisler, David Woodhouse, Robin Murphy, Christoph Hellwig, Marek Szyprowski Add a dma-mapping api helper to retrieve the generic iommu_domain for a device. The motivation for this interface is making RDMA transfers to DAX mappings safe. If the DAX file's block map changes we need to be to reliably stop accesses to blocks that have been freed or re-assigned to a new file. With the iommu_domain and a callback from the DAX filesystem the kernel can safely revoke access to a DMA device. The process that performed the RDMA memory registration is also notified of this revocation event, but the kernel can not otherwise be in the position of waiting for userspace to quiesce the device. Since PMEM+DAX is currently only enabled for x86, we only update the x86 iommu drivers. Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Ashok Raj <ashok.raj@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- Changes since v7: * retrieve the iommu_domain so that we can later pass the results of dma_map_* to iommu_unmap() in advance of the actual dma_unmap_*. drivers/base/dma-mapping.c | 10 ++++++++++ drivers/iommu/amd_iommu.c | 10 ++++++++++ drivers/iommu/intel-iommu.c | 15 +++++++++++++++ include/linux/dma-mapping.h | 3 +++ 4 files changed, 38 insertions(+) diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c index e584eddef0a7..fdb9764f95a4 100644 --- a/drivers/base/dma-mapping.c +++ b/drivers/base/dma-mapping.c @@ -369,3 +369,13 @@ void dma_deconfigure(struct device *dev) of_dma_deconfigure(dev); acpi_dma_deconfigure(dev); } + +struct iommu_domain *dma_get_iommu_domain(struct device *dev) +{ + const struct dma_map_ops *ops = get_dma_ops(dev); + + if (ops && ops->get_iommu) + return ops->get_iommu(dev); + return NULL; +} +EXPORT_SYMBOL(dma_get_iommu_domain); diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c index 51f8215877f5..c8e1a45af182 100644 --- a/drivers/iommu/amd_iommu.c +++ b/drivers/iommu/amd_iommu.c @@ -2271,6 +2271,15 @@ static struct protection_domain *get_domain(struct device *dev) return domain; } +static struct iommu_domain *amd_dma_get_iommu(struct device *dev) +{ + struct protection_domain *domain = get_domain(dev); + + if (IS_ERR(domain)) + return NULL; + return &domain->domain; +} + static void update_device_table(struct protection_domain *domain) { struct iommu_dev_data *dev_data; @@ -2689,6 +2698,7 @@ static const struct dma_map_ops amd_iommu_dma_ops = { .unmap_sg = unmap_sg, .dma_supported = amd_iommu_dma_supported, .mapping_error = amd_iommu_mapping_error, + .get_iommu = amd_dma_get_iommu, }; static int init_reserved_iova_ranges(void) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 6784a05dd6b2..f3f4939cebad 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -3578,6 +3578,20 @@ static int iommu_no_mapping(struct device *dev) return 0; } +static struct iommu_domain *intel_dma_get_iommu(struct device *dev) +{ + struct dmar_domain *domain; + + if (iommu_no_mapping(dev)) + return NULL; + + domain = get_valid_domain_for_dev(dev); + if (!domain) + return NULL; + + return &domain->domain; +} + static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr, size_t size, int dir, u64 dma_mask) { @@ -3872,6 +3886,7 @@ const struct dma_map_ops intel_dma_ops = { .map_page = intel_map_page, .unmap_page = intel_unmap_page, .mapping_error = intel_mapping_error, + .get_iommu = intel_dma_get_iommu, #ifdef CONFIG_X86 .dma_supported = x86_dma_supported, #endif diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 29ce9815da87..aa62df1d0d72 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -128,6 +128,7 @@ struct dma_map_ops { enum dma_data_direction dir); int (*mapping_error)(struct device *dev, dma_addr_t dma_addr); int (*dma_supported)(struct device *dev, u64 mask); + struct iommu_domain *(*get_iommu)(struct device *dev); #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK u64 (*get_required_mask)(struct device *dev); #endif @@ -221,6 +222,8 @@ static inline const struct dma_map_ops *get_dma_ops(struct device *dev) } #endif +extern struct iommu_domain *dma_get_iommu_domain(struct device *dev); + static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr, size_t size, enum dma_data_direction dir, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 49+ messages in thread
[parent not found: <150743420333.12880.6968831423519457797.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH v8] dma-mapping: introduce dma_get_iommu_domain() [not found] ` <150743420333.12880.6968831423519457797.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> @ 2017-10-09 10:37 ` Robin Murphy 2017-10-09 17:32 ` Dan Williams 0 siblings, 1 reply; 49+ messages in thread From: Robin Murphy @ 2017-10-09 10:37 UTC (permalink / raw) To: Dan Williams, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Jeff Moyer, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Ross Zwisler, David Woodhouse, Christoph Hellwig, Marek Szyprowski Hi Dan, On 08/10/17 04:45, Dan Williams wrote: > Add a dma-mapping api helper to retrieve the generic iommu_domain for a device. > The motivation for this interface is making RDMA transfers to DAX mappings > safe. If the DAX file's block map changes we need to be to reliably stop > accesses to blocks that have been freed or re-assigned to a new file. ...which is also going to require some way to force the IOMMU drivers (on x86 at least) to do a fully-synchronous unmap, instead of just throwing the IOVA onto a flush queue to invalidate the TLBs at some point in the future. Assuming of course that there's an IOMMU both present and performing DMA translation in the first place. > With the > iommu_domain and a callback from the DAX filesystem the kernel can safely > revoke access to a DMA device. The process that performed the RDMA memory > registration is also notified of this revocation event, but the kernel can not > otherwise be in the position of waiting for userspace to quiesce the device. OK, but why reinvent iommu_get_domain_for_dev()? > Since PMEM+DAX is currently only enabled for x86, we only update the x86 > iommu drivers. Note in particular that those two drivers happen to be the *only* place this approach could work - everyone else is going to have to fall back to the generic IOMMU API function anyway. Robin. > Cc: Marek Szyprowski <m.szyprowski-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org> > Cc: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org> > Cc: Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> > Cc: Joerg Roedel <joro-zLv9SwRftAIdnm+yROfE0A@public.gmane.org> > Cc: David Woodhouse <dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> > Cc: Ashok Raj <ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> > Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > Cc: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> > Cc: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> > Cc: "Darrick J. Wong" <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> > Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> > --- > Changes since v7: > * retrieve the iommu_domain so that we can later pass the results of > dma_map_* to iommu_unmap() in advance of the actual dma_unmap_*. > > drivers/base/dma-mapping.c | 10 ++++++++++ > drivers/iommu/amd_iommu.c | 10 ++++++++++ > drivers/iommu/intel-iommu.c | 15 +++++++++++++++ > include/linux/dma-mapping.h | 3 +++ > 4 files changed, 38 insertions(+) > > diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c > index e584eddef0a7..fdb9764f95a4 100644 > --- a/drivers/base/dma-mapping.c > +++ b/drivers/base/dma-mapping.c > @@ -369,3 +369,13 @@ void dma_deconfigure(struct device *dev) > of_dma_deconfigure(dev); > acpi_dma_deconfigure(dev); > } > + > +struct iommu_domain *dma_get_iommu_domain(struct device *dev) > +{ > + const struct dma_map_ops *ops = get_dma_ops(dev); > + > + if (ops && ops->get_iommu) > + return ops->get_iommu(dev); > + return NULL; > +} > +EXPORT_SYMBOL(dma_get_iommu_domain); > diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c > index 51f8215877f5..c8e1a45af182 100644 > --- a/drivers/iommu/amd_iommu.c > +++ b/drivers/iommu/amd_iommu.c > @@ -2271,6 +2271,15 @@ static struct protection_domain *get_domain(struct device *dev) > return domain; > } > > +static struct iommu_domain *amd_dma_get_iommu(struct device *dev) > +{ > + struct protection_domain *domain = get_domain(dev); > + > + if (IS_ERR(domain)) > + return NULL; > + return &domain->domain; > +} > + > static void update_device_table(struct protection_domain *domain) > { > struct iommu_dev_data *dev_data; > @@ -2689,6 +2698,7 @@ static const struct dma_map_ops amd_iommu_dma_ops = { > .unmap_sg = unmap_sg, > .dma_supported = amd_iommu_dma_supported, > .mapping_error = amd_iommu_mapping_error, > + .get_iommu = amd_dma_get_iommu, > }; > > static int init_reserved_iova_ranges(void) > diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c > index 6784a05dd6b2..f3f4939cebad 100644 > --- a/drivers/iommu/intel-iommu.c > +++ b/drivers/iommu/intel-iommu.c > @@ -3578,6 +3578,20 @@ static int iommu_no_mapping(struct device *dev) > return 0; > } > > +static struct iommu_domain *intel_dma_get_iommu(struct device *dev) > +{ > + struct dmar_domain *domain; > + > + if (iommu_no_mapping(dev)) > + return NULL; > + > + domain = get_valid_domain_for_dev(dev); > + if (!domain) > + return NULL; > + > + return &domain->domain; > +} > + > static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr, > size_t size, int dir, u64 dma_mask) > { > @@ -3872,6 +3886,7 @@ const struct dma_map_ops intel_dma_ops = { > .map_page = intel_map_page, > .unmap_page = intel_unmap_page, > .mapping_error = intel_mapping_error, > + .get_iommu = intel_dma_get_iommu, > #ifdef CONFIG_X86 > .dma_supported = x86_dma_supported, > #endif > diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h > index 29ce9815da87..aa62df1d0d72 100644 > --- a/include/linux/dma-mapping.h > +++ b/include/linux/dma-mapping.h > @@ -128,6 +128,7 @@ struct dma_map_ops { > enum dma_data_direction dir); > int (*mapping_error)(struct device *dev, dma_addr_t dma_addr); > int (*dma_supported)(struct device *dev, u64 mask); > + struct iommu_domain *(*get_iommu)(struct device *dev); > #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK > u64 (*get_required_mask)(struct device *dev); > #endif > @@ -221,6 +222,8 @@ static inline const struct dma_map_ops *get_dma_ops(struct device *dev) > } > #endif > > +extern struct iommu_domain *dma_get_iommu_domain(struct device *dev); > + > static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr, > size_t size, > enum dma_data_direction dir, > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v8] dma-mapping: introduce dma_get_iommu_domain() 2017-10-09 10:37 ` Robin Murphy @ 2017-10-09 17:32 ` Dan Williams 0 siblings, 0 replies; 49+ messages in thread From: Dan Williams @ 2017-10-09 17:32 UTC (permalink / raw) To: Robin Murphy Cc: linux-nvdimm@lists.01.org, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma, Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs, Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler, David Woodhouse, Christoph Hellwig, Marek Szyprowski On Mon, Oct 9, 2017 at 3:37 AM, Robin Murphy <robin.murphy@arm.com> wrote: > Hi Dan, > > On 08/10/17 04:45, Dan Williams wrote: >> Add a dma-mapping api helper to retrieve the generic iommu_domain for a device. >> The motivation for this interface is making RDMA transfers to DAX mappings >> safe. If the DAX file's block map changes we need to be to reliably stop >> accesses to blocks that have been freed or re-assigned to a new file. > > ...which is also going to require some way to force the IOMMU drivers > (on x86 at least) to do a fully-synchronous unmap, instead of just > throwing the IOVA onto a flush queue to invalidate the TLBs at some > point in the future. Isn't that the difference between iommu_unmap() and iommu_unmap_fast()? As far as I can tell amd-iommu and intel-iommu both flush iotlbs on iommu_unmap() and don't support fast unmaps. > Assuming of course that there's an IOMMU both > present and performing DMA translation in the first place. That's why I want to call through the dma api to see if the iommu is being used to satisfy dma mappings. >> With the >> iommu_domain and a callback from the DAX filesystem the kernel can safely >> revoke access to a DMA device. The process that performed the RDMA memory >> registration is also notified of this revocation event, but the kernel can not >> otherwise be in the position of waiting for userspace to quiesce the device. > > OK, but why reinvent iommu_get_domain_for_dev()? How do I know if the iommu returned from that routine is the one being used for dma mapping operations for the device? Specifically, how would I discover that the result of dma_map_sg() can be passed as an IOVA range to iommu_unmap()? >> Since PMEM+DAX is currently only enabled for x86, we only update the x86 >> iommu drivers. > > Note in particular that those two drivers happen to be the *only* place > this approach could work - everyone else is going to have to fall back > to the generic IOMMU API function anyway. I want to make this functionality generic, but I'm not familiar with the iommu sub-system. How are dma mapping operations routed to the iommu driver in those other imlementations? ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v8] dma-mapping: introduce dma_get_iommu_domain() 2017-10-08 3:45 ` [PATCH v8] dma-mapping: introduce dma_get_iommu_domain() Dan Williams [not found] ` <150743420333.12880.6968831423519457797.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> @ 2017-10-10 14:40 ` Raj, Ashok 1 sibling, 0 replies; 49+ messages in thread From: Raj, Ashok @ 2017-10-10 14:40 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, Jan Kara, Darrick J. Wong, linux-rdma, Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs, linux-mm, Jeff Moyer, linux-api, linux-fsdevel, Ross Zwisler, David Woodhouse, Robin Murphy, Christoph Hellwig, Marek Szyprowski Hi Dan On Sat, Oct 07, 2017 at 08:45:00PM -0700, Dan Williams wrote: > Add a dma-mapping api helper to retrieve the generic iommu_domain for a device. > The motivation for this interface is making RDMA transfers to DAX mappings > safe. If the DAX file's block map changes we need to be to reliably stop > accesses to blocks that have been freed or re-assigned to a new file. With the > iommu_domain and a callback from the DAX filesystem the kernel can safely > revoke access to a DMA device. The process that performed the RDMA memory > registration is also notified of this revocation event, but the kernel can not > otherwise be in the position of waiting for userspace to quiesce the device. > > Since PMEM+DAX is currently only enabled for x86, we only update the x86 > iommu drivers. > > Cc: Marek Szyprowski <m.szyprowski@samsung.com> > Cc: Robin Murphy <robin.murphy@arm.com> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Cc: Joerg Roedel <joro@8bytes.org> > Cc: David Woodhouse <dwmw2@infradead.org> > Cc: Ashok Raj <ashok.raj@intel.com> > Cc: Jan Kara <jack@suse.cz> > Cc: Jeff Moyer <jmoyer@redhat.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Dave Chinner <david@fromorbit.com> > Cc: "Darrick J. Wong" <darrick.wong@oracle.com> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > Changes since v7: > * retrieve the iommu_domain so that we can later pass the results of > dma_map_* to iommu_unmap() in advance of the actual dma_unmap_*. > > drivers/base/dma-mapping.c | 10 ++++++++++ > drivers/iommu/amd_iommu.c | 10 ++++++++++ > drivers/iommu/intel-iommu.c | 15 +++++++++++++++ > include/linux/dma-mapping.h | 3 +++ > 4 files changed, 38 insertions(+) > > diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c > index e584eddef0a7..fdb9764f95a4 100644 > --- a/drivers/base/dma-mapping.c > +++ b/drivers/base/dma-mapping.c > @@ -369,3 +369,13 @@ void dma_deconfigure(struct device *dev) > of_dma_deconfigure(dev); > acpi_dma_deconfigure(dev); > } > + > +struct iommu_domain *dma_get_iommu_domain(struct device *dev) > +{ > + const struct dma_map_ops *ops = get_dma_ops(dev); > + > + if (ops && ops->get_iommu) > + return ops->get_iommu(dev); > + return NULL; > +} > +EXPORT_SYMBOL(dma_get_iommu_domain); > diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c > index 51f8215877f5..c8e1a45af182 100644 > --- a/drivers/iommu/amd_iommu.c > +++ b/drivers/iommu/amd_iommu.c > @@ -2271,6 +2271,15 @@ static struct protection_domain *get_domain(struct device *dev) > return domain; > } > > +static struct iommu_domain *amd_dma_get_iommu(struct device *dev) Minor: Do you want to keep the naming consistent.. amd_dma_get_domain() vs get_iommu? > +{ > + struct protection_domain *domain = get_domain(dev); > + > + if (IS_ERR(domain)) > + return NULL; > + return &domain->domain; > +} > + > static void update_device_table(struct protection_domain *domain) > { > struct iommu_dev_data *dev_data; > @@ -2689,6 +2698,7 @@ static const struct dma_map_ops amd_iommu_dma_ops = { > .unmap_sg = unmap_sg, > .dma_supported = amd_iommu_dma_supported, > .mapping_error = amd_iommu_mapping_error, > + .get_iommu = amd_dma_get_iommu, ditto for here and other places below: > }; > > static int init_reserved_iova_ranges(void) > diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c > index 6784a05dd6b2..f3f4939cebad 100644 > --- a/drivers/iommu/intel-iommu.c > +++ b/drivers/iommu/intel-iommu.c > @@ -3578,6 +3578,20 @@ static int iommu_no_mapping(struct device *dev) > return 0; > } > > +static struct iommu_domain *intel_dma_get_iommu(struct device *dev) > +{ > + struct dmar_domain *domain; > + > + if (iommu_no_mapping(dev)) > + return NULL; > + > + domain = get_valid_domain_for_dev(dev); > + if (!domain) > + return NULL; > + > + return &domain->domain; > +} > + > static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr, > size_t size, int dir, u64 dma_mask) > { > @@ -3872,6 +3886,7 @@ const struct dma_map_ops intel_dma_ops = { > .map_page = intel_map_page, > .unmap_page = intel_unmap_page, > .mapping_error = intel_mapping_error, > + .get_iommu = intel_dma_get_iommu, > #ifdef CONFIG_X86 > .dma_supported = x86_dma_supported, > #endif > diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h > index 29ce9815da87..aa62df1d0d72 100644 > --- a/include/linux/dma-mapping.h > +++ b/include/linux/dma-mapping.h > @@ -128,6 +128,7 @@ struct dma_map_ops { > enum dma_data_direction dir); > int (*mapping_error)(struct device *dev, dma_addr_t dma_addr); > int (*dma_supported)(struct device *dev, u64 mask); > + struct iommu_domain *(*get_iommu)(struct device *dev); > #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK > u64 (*get_required_mask)(struct device *dev); > #endif > @@ -221,6 +222,8 @@ static inline const struct dma_map_ops *get_dma_ops(struct device *dev) > } > #endif > > +extern struct iommu_domain *dma_get_iommu_domain(struct device *dev); > + > static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr, > size_t size, > enum dma_data_direction dir, > ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <150732935473.22363.1853399637339625023.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() [not found] ` <150732935473.22363.1853399637339625023.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> @ 2017-10-09 18:58 ` Jason Gunthorpe 2017-10-09 19:05 ` Dan Williams 0 siblings, 1 reply; 49+ messages in thread From: Jason Gunthorpe @ 2017-10-09 18:58 UTC (permalink / raw) To: Dan Williams Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman, Joerg Roedel, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner, Robin Murphy, linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, David Woodhouse, Christoph Hellwig, Marek Szyprowski On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote: > otherwise be quiesced. The need for this knowledge is driven by a need > to make RDMA transfers to DAX mappings safe. If the DAX file's block map > changes we need to be to reliably stop accesses to blocks that have been > freed or re-assigned to a new file. If RDMA is driving this need, why not invalidate backing RDMA MRs instead of requiring a IOMMU to do it? RDMA MR are finer grained and do not suffer from the re-use problem David W. brought up with IOVAs.. Jason ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() 2017-10-09 18:58 ` [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() Jason Gunthorpe @ 2017-10-09 19:05 ` Dan Williams [not found] ` <CAPcyv4gXzC8OUgO_PciQ2phyq0YtmXjMGWvoPSVVuuZR7ohVCg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: Dan Williams @ 2017-10-09 19:05 UTC (permalink / raw) To: Jason Gunthorpe Cc: linux-nvdimm@lists.01.org, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma, Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs, Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler, David Woodhouse, Robin Murphy, Christoph Hellwig, Marek Szyprowski On Mon, Oct 9, 2017 at 11:58 AM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote: >> otherwise be quiesced. The need for this knowledge is driven by a need >> to make RDMA transfers to DAX mappings safe. If the DAX file's block map >> changes we need to be to reliably stop accesses to blocks that have been >> freed or re-assigned to a new file. > > If RDMA is driving this need, why not invalidate backing RDMA MRs > instead of requiring a IOMMU to do it? RDMA MR are finer grained and > do not suffer from the re-use problem David W. brought up with IOVAs.. Sounds promising. All I want in the end is to be sure that the kernel is enabled to stop any in-flight RDMA at will without asking userspace. Does this require per-RDMA driver opt-in or is there a common call that can be made? Outside of that the re-use problem is already solved by just unmapping (iommu_unmap()) the IOVA, but keeping it allocated until the eventual dma_unmap_sg() at memory un-registration time frees it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <CAPcyv4gXzC8OUgO_PciQ2phyq0YtmXjMGWvoPSVVuuZR7ohVCg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() [not found] ` <CAPcyv4gXzC8OUgO_PciQ2phyq0YtmXjMGWvoPSVVuuZR7ohVCg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2017-10-09 19:18 ` Jason Gunthorpe [not found] ` <20171009191820.GD15336-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: Jason Gunthorpe @ 2017-10-09 19:18 UTC (permalink / raw) To: Dan Williams Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman, Joerg Roedel, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org, Dave Chinner, Robin Murphy, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Linux API, linux-fsdevel, David Woodhouse, Christoph Hellwig, Marek Szyprowski On Mon, Oct 09, 2017 at 12:05:30PM -0700, Dan Williams wrote: > On Mon, Oct 9, 2017 at 11:58 AM, Jason Gunthorpe > <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: > > On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote: > >> otherwise be quiesced. The need for this knowledge is driven by a need > >> to make RDMA transfers to DAX mappings safe. If the DAX file's block map > >> changes we need to be to reliably stop accesses to blocks that have been > >> freed or re-assigned to a new file. > > > > If RDMA is driving this need, why not invalidate backing RDMA MRs > > instead of requiring a IOMMU to do it? RDMA MR are finer grained and > > do not suffer from the re-use problem David W. brought up with IOVAs.. > > Sounds promising. All I want in the end is to be sure that the kernel > is enabled to stop any in-flight RDMA at will without asking > userspace. Does this require per-RDMA driver opt-in or is there a > common call that can be made? I don't think this has ever come up in the context of an all-device MR invalidate requirement. Drivers already have code to invalidate specifc MRs, but to find all MRs that touch certain pages and then invalidate them would be new code. We also have ODP aware drivers that can retarget a MR to new physical pages. If the block map changes DAX should synchronously retarget the ODP MR, not halt DMA. Most likely ODP & DAX would need to be used together to get robust user applications, as having the user QP's go to an error state at random times (due to DMA failures) during operation is never going to be acceptable... Perhaps you might want to initially only support ODP MR mappings with DAX and then the DMA fencing issue goes away? Cheers, Jason ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <20171009191820.GD15336-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() [not found] ` <20171009191820.GD15336-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2017-10-09 19:28 ` Dan Williams [not found] ` <CAPcyv4h_uQGBAX6-bMkkZLO_YyQ6t4n_b8tH8wU_P0Jh23N5MQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-10-13 7:09 ` Christoph Hellwig 1 sibling, 1 reply; 49+ messages in thread From: Dan Williams @ 2017-10-09 19:28 UTC (permalink / raw) To: Jason Gunthorpe Cc: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler, David Woodhouse, Robin Murphy, Christoph Hellwig, Marek Szyprowski On Mon, Oct 9, 2017 at 12:18 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: > On Mon, Oct 09, 2017 at 12:05:30PM -0700, Dan Williams wrote: >> On Mon, Oct 9, 2017 at 11:58 AM, Jason Gunthorpe >> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: >> > On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote: >> >> otherwise be quiesced. The need for this knowledge is driven by a need >> >> to make RDMA transfers to DAX mappings safe. If the DAX file's block map >> >> changes we need to be to reliably stop accesses to blocks that have been >> >> freed or re-assigned to a new file. >> > >> > If RDMA is driving this need, why not invalidate backing RDMA MRs >> > instead of requiring a IOMMU to do it? RDMA MR are finer grained and >> > do not suffer from the re-use problem David W. brought up with IOVAs.. >> >> Sounds promising. All I want in the end is to be sure that the kernel >> is enabled to stop any in-flight RDMA at will without asking >> userspace. Does this require per-RDMA driver opt-in or is there a >> common call that can be made? > > I don't think this has ever come up in the context of an all-device MR > invalidate requirement. Drivers already have code to invalidate > specifc MRs, but to find all MRs that touch certain pages and then > invalidate them would be new code. > > We also have ODP aware drivers that can retarget a MR to new > physical pages. If the block map changes DAX should synchronously > retarget the ODP MR, not halt DMA. Have a look at the patch [1], I don't touch the ODP path. > Most likely ODP & DAX would need to be used together to get robust > user applications, as having the user QP's go to an error state at > random times (due to DMA failures) during operation is never going to > be acceptable... It's not random. The process that set up the mapping and registered the memory gets SIGIO when someone else tries to modify the file map. That process then gets /proc/sys/fs/lease-break-time seconds to fix the problem before the kernel force revokes the DMA access. It's otherwise not acceptable to allow DMA into random locations when the file map changes. > Perhaps you might want to initially only support ODP MR mappings with > DAX and then the DMA fencing issue goes away? I'd rather try to fix the non-ODP DAX case instead of just turning it off. [1]: https://patchwork.kernel.org/patch/9991681/ -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <CAPcyv4h_uQGBAX6-bMkkZLO_YyQ6t4n_b8tH8wU_P0Jh23N5MQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() [not found] ` <CAPcyv4h_uQGBAX6-bMkkZLO_YyQ6t4n_b8tH8wU_P0Jh23N5MQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2017-10-10 17:25 ` Jason Gunthorpe [not found] ` <20171010172516.GA29915-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: Jason Gunthorpe @ 2017-10-10 17:25 UTC (permalink / raw) To: Dan Williams Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman, Joerg Roedel, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org, Dave Chinner, Robin Murphy, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Linux API, linux-fsdevel, David Woodhouse, Christoph Hellwig, Marek Szyprowski On Mon, Oct 09, 2017 at 12:28:29PM -0700, Dan Williams wrote: > > I don't think this has ever come up in the context of an all-device MR > > invalidate requirement. Drivers already have code to invalidate > > specifc MRs, but to find all MRs that touch certain pages and then > > invalidate them would be new code. > > > > We also have ODP aware drivers that can retarget a MR to new > > physical pages. If the block map changes DAX should synchronously > > retarget the ODP MR, not halt DMA. > > Have a look at the patch [1], I don't touch the ODP path. But, does ODP work OK already? I'm not clear on that.. > > Most likely ODP & DAX would need to be used together to get robust > > user applications, as having the user QP's go to an error state at > > random times (due to DMA failures) during operation is never going to > > be acceptable... > > It's not random. The process that set up the mapping and registered > the memory gets SIGIO when someone else tries to modify the file map. > That process then gets /proc/sys/fs/lease-break-time seconds to fix > the problem before the kernel force revokes the DMA access. Well, the process can't fix the problem in bounded time, so it is random if it will fail or not. MR life time is under the control of the remote side, and time to complete the network exchanges required to release the MRs is hard to bound. So even if I implement SIGIO properly my app will still likely have random QP failures under various cases and work loads. :( This is why ODP should be the focus because this cannot work fully reliably otherwise.. > > Perhaps you might want to initially only support ODP MR mappings with > > DAX and then the DMA fencing issue goes away? > > I'd rather try to fix the non-ODP DAX case instead of just turning it off. Well, what about using SIGKILL if the lease-break-time hits? The kernel will clean up the MRs when the process exits and this will fence DMA to that memory. But, still, if you really want to be fined graned, then I think invalidating the impacted MR's is a better solution for RDMA than trying to do it with the IOMMU... Jason ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <20171010172516.GA29915-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() [not found] ` <20171010172516.GA29915-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2017-10-10 17:39 ` Dan Williams [not found] ` <CAPcyv4jL5fN7jjXkQum8ERQ45eW63dCYp5Pm6aHY4OPudz4Wsw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: Dan Williams @ 2017-10-10 17:39 UTC (permalink / raw) To: Jason Gunthorpe Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman, Joerg Roedel, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org, Dave Chinner, Robin Murphy, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Linux API, linux-fsdevel, David Woodhouse, Christoph Hellwig, Marek Szyprowski On Tue, Oct 10, 2017 at 10:25 AM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: > On Mon, Oct 09, 2017 at 12:28:29PM -0700, Dan Williams wrote: > >> > I don't think this has ever come up in the context of an all-device MR >> > invalidate requirement. Drivers already have code to invalidate >> > specifc MRs, but to find all MRs that touch certain pages and then >> > invalidate them would be new code. >> > >> > We also have ODP aware drivers that can retarget a MR to new >> > physical pages. If the block map changes DAX should synchronously >> > retarget the ODP MR, not halt DMA. >> >> Have a look at the patch [1], I don't touch the ODP path. > > But, does ODP work OK already? I'm not clear on that.. It had better. If the mapping is invalidated I would hope that generates an io fault that gets handled by the driver to setup the new mapping. I don't see how it can work otherwise. >> > Most likely ODP & DAX would need to be used together to get robust >> > user applications, as having the user QP's go to an error state at >> > random times (due to DMA failures) during operation is never going to >> > be acceptable... >> >> It's not random. The process that set up the mapping and registered >> the memory gets SIGIO when someone else tries to modify the file map. >> That process then gets /proc/sys/fs/lease-break-time seconds to fix >> the problem before the kernel force revokes the DMA access. > > Well, the process can't fix the problem in bounded time, so it is > random if it will fail or not. > > MR life time is under the control of the remote side, and time to > complete the network exchanges required to release the MRs is hard to > bound. So even if I implement SIGIO properly my app will still likely > have random QP failures under various cases and work loads. :( > > This is why ODP should be the focus because this cannot work fully > reliably otherwise.. The lease break time is configurable. If that application can't respond to a stop request within a timeout of its own choosing then it should not be using DAX mappings. > >> > Perhaps you might want to initially only support ODP MR mappings with >> > DAX and then the DMA fencing issue goes away? >> >> I'd rather try to fix the non-ODP DAX case instead of just turning it off. > > Well, what about using SIGKILL if the lease-break-time hits? The > kernel will clean up the MRs when the process exits and this will > fence DMA to that memory. Can you point me to where the MR cleanup code fences DMA and quiesces the device? > But, still, if you really want to be fined graned, then I think > invalidating the impacted MR's is a better solution for RDMA than > trying to do it with the IOMMU... If there's a better routine for handling ib_umem_lease_break() I'd love to use it. Right now I'm reaching for the only tool I know for kernel enforced revocation of DMA access. ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <CAPcyv4jL5fN7jjXkQum8ERQ45eW63dCYp5Pm6aHY4OPudz4Wsw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() [not found] ` <CAPcyv4jL5fN7jjXkQum8ERQ45eW63dCYp5Pm6aHY4OPudz4Wsw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2017-10-10 18:05 ` Jason Gunthorpe 2017-10-10 20:17 ` Dan Williams 0 siblings, 1 reply; 49+ messages in thread From: Jason Gunthorpe @ 2017-10-10 18:05 UTC (permalink / raw) To: Dan Williams Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman, Joerg Roedel, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org, Dave Chinner, Robin Murphy, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Linux API, linux-fsdevel, David Woodhouse, Christoph Hellwig, Marek Szyprowski On Tue, Oct 10, 2017 at 10:39:27AM -0700, Dan Williams wrote: > On Tue, Oct 10, 2017 at 10:25 AM, Jason Gunthorpe > >> Have a look at the patch [1], I don't touch the ODP path. > > > > But, does ODP work OK already? I'm not clear on that.. > > It had better. If the mapping is invalidated I would hope that > generates an io fault that gets handled by the driver to setup the new > mapping. I don't see how it can work otherwise. I would assume so too... > > This is why ODP should be the focus because this cannot work fully > > reliably otherwise.. > > The lease break time is configurable. If that application can't > respond to a stop request within a timeout of its own choosing then it > should not be using DAX mappings. Well, no RDMA application can really do this, unless you set the timeout to multiple minutes, on par with network timeouts. Again, these details are why I think this kind of DAX and non ODP-MRs are probably practically not too useful for a production system. Great for test of course, but in that case SIGKILL would be fine too... > > Well, what about using SIGKILL if the lease-break-time hits? The > > kernel will clean up the MRs when the process exits and this will > > fence DMA to that memory. > > Can you point me to where the MR cleanup code fences DMA and quiesces > the device? Yes. The MR's are associated with an fd. When the fd is closed ib_uverbs_close triggers ib_uverbs_cleanup_ucontext which runs through all the objects, including MRs, and deletes them. The specification for deleting a MR requires a synchronous fence with the hardware. After MR deletion the hardware will not DMA to any pages described by the old MR, and those pages will be unpinned. > > But, still, if you really want to be fined graned, then I think > > invalidating the impacted MR's is a better solution for RDMA than > > trying to do it with the IOMMU... > > If there's a better routine for handling ib_umem_lease_break() I'd > love to use it. Right now I'm reaching for the only tool I know for > kernel enforced revocation of DMA access. Well, you'd have to code something in the MR code to keep track of DAX MRs and issue an out of band invalidate to impacted MRs to create the fence. This probably needs some driver work, I'm not sure if all the hardware can do out of band invalidate to any MR or not.. Generally speaking, in RDMA, when a new feature like this comes along we have to push a lot of the work down to the driver authors, and the approach has historically been that new features only work on some hardware (as much as I dislike this, it is pragmatic) So, not being able to support DAX on certain RDMA hardware is not an unreasonable situation in our space. Jason ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() 2017-10-10 18:05 ` Jason Gunthorpe @ 2017-10-10 20:17 ` Dan Williams 2017-10-12 18:27 ` Jason Gunthorpe 0 siblings, 1 reply; 49+ messages in thread From: Dan Williams @ 2017-10-10 20:17 UTC (permalink / raw) To: Jason Gunthorpe Cc: linux-nvdimm@lists.01.org, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma, Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs, Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler, David Woodhouse, Robin Murphy, Christoph Hellwig, Marek Szyprowski On Tue, Oct 10, 2017 at 11:05 AM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Tue, Oct 10, 2017 at 10:39:27AM -0700, Dan Williams wrote: >> On Tue, Oct 10, 2017 at 10:25 AM, Jason Gunthorpe > >> >> Have a look at the patch [1], I don't touch the ODP path. >> > >> > But, does ODP work OK already? I'm not clear on that.. >> >> It had better. If the mapping is invalidated I would hope that >> generates an io fault that gets handled by the driver to setup the new >> mapping. I don't see how it can work otherwise. > > I would assume so too... > >> > This is why ODP should be the focus because this cannot work fully >> > reliably otherwise.. >> >> The lease break time is configurable. If that application can't >> respond to a stop request within a timeout of its own choosing then it >> should not be using DAX mappings. > > Well, no RDMA application can really do this, unless you set the > timeout to multiple minutes, on par with network timeouts. The default lease break timeout is 45 seconds on my system, so minutes does not seem out of the question. Also keep in mind that what triggers the lease break is another application trying to write or punch holes in a file that is mapped for RDMA. So, if the hardware can't handle the iommu mapping getting invalidated asynchronously and the application can't react in the lease break timeout period then the administrator should arrange for the file to not be written or truncated while it is mapped. It's already the case that get_user_pages() does not lock down file associations, so if your application is contending with these types of file changes it likely already has a problem keeping transactions in sync with the file state even without DAX. > Again, these details are why I think this kind of DAX and non ODP-MRs > are probably practically not too useful for a production system. Great > for test of course, but in that case SIGKILL would be fine too... > >> > Well, what about using SIGKILL if the lease-break-time hits? The >> > kernel will clean up the MRs when the process exits and this will >> > fence DMA to that memory. >> >> Can you point me to where the MR cleanup code fences DMA and quiesces >> the device? > > Yes. The MR's are associated with an fd. When the fd is closed > ib_uverbs_close triggers ib_uverbs_cleanup_ucontext which runs through > all the objects, including MRs, and deletes them. > > The specification for deleting a MR requires a synchronous fence with > the hardware. After MR deletion the hardware will not DMA to any pages > described by the old MR, and those pages will be unpinned. > >> > But, still, if you really want to be fined graned, then I think >> > invalidating the impacted MR's is a better solution for RDMA than >> > trying to do it with the IOMMU... >> >> If there's a better routine for handling ib_umem_lease_break() I'd >> love to use it. Right now I'm reaching for the only tool I know for >> kernel enforced revocation of DMA access. > > Well, you'd have to code something in the MR code to keep track of DAX > MRs and issue an out of band invalidate to impacted MRs to create the > fence. > > This probably needs some driver work, I'm not sure if all the hardware > can do out of band invalidate to any MR or not.. Ok. > > Generally speaking, in RDMA, when a new feature like this comes along > we have to push a lot of the work down to the driver authors, and the > approach has historically been that new features only work on some > hardware (as much as I dislike this, it is pragmatic) > > So, not being able to support DAX on certain RDMA hardware is not > an unreasonable situation in our space. That makes sense, but it still seems to me that this proposed solution allows more than enough ways to avoid that worst case scenario where hardware reacts badly to iommu invalidation. Drivers that can do better than iommu invalidation can arrange for a callback to do their driver-specific action at lease break time. Hardware that can't should be blacklisted from supporting DAX altogether. In other words this is a starting point to incrementally enhance or disable specific drivers, but with the assurance that the kernel can always do the safe thing when / if the driver is missing a finer grained solution. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() 2017-10-10 20:17 ` Dan Williams @ 2017-10-12 18:27 ` Jason Gunthorpe 2017-10-12 20:10 ` Dan Williams 0 siblings, 1 reply; 49+ messages in thread From: Jason Gunthorpe @ 2017-10-12 18:27 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm@lists.01.org, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma, Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs, Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler, David Woodhouse, Robin Murphy, Christoph Hellwig, Marek Szyprowski On Tue, Oct 10, 2017 at 01:17:26PM -0700, Dan Williams wrote: > Also keep in mind that what triggers the lease break is another > application trying to write or punch holes in a file that is mapped > for RDMA. So, if the hardware can't handle the iommu mapping getting > invalidated asynchronously and the application can't react in the > lease break timeout period then the administrator should arrange for > the file to not be written or truncated while it is mapped. That makes sense, but why not return ENOSYS or something to the app trying to alter the file if the RDMA hardware can't support this instead of having the RDMA app deal with this lease break weirdness? > It's already the case that get_user_pages() does not lock down file > associations, so if your application is contending with these types of > file changes it likely already has a problem keeping transactions in > sync with the file state even without DAX. Yes, things go weird in non-ODP RDMA cases like this.. Also, just to clear, I would expect an app using the SIGIO interface to basically halt ongoing RDMA, wait for MRs to become unused locally and remotely, destroy the MRs, then somehow, establish new MRs that cover the same logical map (eg what ODP would do transparently) after the lease breaker has made their changes, then restart their IO. Does your SIGIO approach have a race-free way to do that last steps? > > So, not being able to support DAX on certain RDMA hardware is not > > an unreasonable situation in our space. > > That makes sense, but it still seems to me that this proposed solution > allows more than enough ways to avoid that worst case scenario where > hardware reacts badly to iommu invalidation. Yes, although I am concerned that returning PCI-E errors is such an unusual and untested path for some of our RDMA drivers that they may malfunction badly... Again, going back to the question of who would ever use this, I would be very relucant to deploy a production configuration relying on the iommu invalidate or SIGIO techniques, when ODP HW is available and works flawlessly. > be blacklisted from supporting DAX altogether. In other words this is > a starting point to incrementally enhance or disable specific drivers, > but with the assurance that the kernel can always do the safe thing > when / if the driver is missing a finer grained solution. Seems reasonable.. I think existing HW will have an easier time adding invalidate, while new hardware really should implement ODP. Jason -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() 2017-10-12 18:27 ` Jason Gunthorpe @ 2017-10-12 20:10 ` Dan Williams 2017-10-13 6:50 ` Christoph Hellwig 0 siblings, 1 reply; 49+ messages in thread From: Dan Williams @ 2017-10-12 20:10 UTC (permalink / raw) To: Jason Gunthorpe Cc: linux-nvdimm@lists.01.org, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma, Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs, Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler, David Woodhouse, Robin Murphy, Christoph Hellwig, Marek Szyprowski On Thu, Oct 12, 2017 at 11:27 AM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Tue, Oct 10, 2017 at 01:17:26PM -0700, Dan Williams wrote: > >> Also keep in mind that what triggers the lease break is another >> application trying to write or punch holes in a file that is mapped >> for RDMA. So, if the hardware can't handle the iommu mapping getting >> invalidated asynchronously and the application can't react in the >> lease break timeout period then the administrator should arrange for >> the file to not be written or truncated while it is mapped. > > That makes sense, but why not return ENOSYS or something to the app > trying to alter the file if the RDMA hardware can't support this > instead of having the RDMA app deal with this lease break weirdness? That's where I started, an inode flag that said "hands off, this file is busy", but Christoph pointed out that we should reuse the same mechanisms that pnfs is using. The pnfs protection scheme uses file leases, and once the kernel decides that a lease needs to be broken / layout needs to be recalled there is no stopping it, only delaying. >> It's already the case that get_user_pages() does not lock down file >> associations, so if your application is contending with these types of >> file changes it likely already has a problem keeping transactions in >> sync with the file state even without DAX. > > Yes, things go weird in non-ODP RDMA cases like this.. > > Also, just to clear, I would expect an app using the SIGIO interface > to basically halt ongoing RDMA, wait for MRs to become unused locally > and remotely, destroy the MRs, then somehow, establish new MRs that > cover the same logical map (eg what ODP would do transparently) after > the lease breaker has made their changes, then restart their IO. > > Does your SIGIO approach have a race-free way to do that last steps? After the SIGIO that's becomes a userspace / driver problem to quiesce the I/O... However, chatting this over with a few more people I have an alternate solution that effectively behaves the same as how non-ODP hardware handles this case of hole punch / truncation today. So, today if this scenario happens on a page-cache backed mapping, the file blocks are unmapped and the RDMA continues into pinned pages that are no longer part of the file. We can achieve the same thing with the iommu, just re-target the I/O into memory that isn't part of the file. That way hardware does not see I/O errors and the DAX data consistency model is no worse than the page-cache case. >> > So, not being able to support DAX on certain RDMA hardware is not >> > an unreasonable situation in our space. >> >> That makes sense, but it still seems to me that this proposed solution >> allows more than enough ways to avoid that worst case scenario where >> hardware reacts badly to iommu invalidation. > > Yes, although I am concerned that returning PCI-E errors is such an > unusual and untested path for some of our RDMA drivers that they may > malfunction badly... > > Again, going back to the question of who would ever use this, I would > be very relucant to deploy a production configuration relying on the iommu > invalidate or SIGIO techniques, when ODP HW is available and works > flawlessly. I don't think it is reasonable to tell people you need to throw away your old hardware just because you want to target a DAX mapping. >> be blacklisted from supporting DAX altogether. In other words this is >> a starting point to incrementally enhance or disable specific drivers, >> but with the assurance that the kernel can always do the safe thing >> when / if the driver is missing a finer grained solution. > > Seems reasonable.. I think existing HW will have an easier time adding > invalidate, while new hardware really should implement ODP. > Yeah, so if we go with 'remap' instead of 'invalidate' does that address your concerns? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() 2017-10-12 20:10 ` Dan Williams @ 2017-10-13 6:50 ` Christoph Hellwig [not found] ` <20171013065047.GA26461-jcswGhMUV9g@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: Christoph Hellwig @ 2017-10-13 6:50 UTC (permalink / raw) To: Dan Williams Cc: Jason Gunthorpe, linux-nvdimm@lists.01.org, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma, Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs, Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler, David Woodhouse, Robin Murphy, Christoph Hellwig, Marek Szyprowski On Thu, Oct 12, 2017 at 01:10:33PM -0700, Dan Williams wrote: > On Thu, Oct 12, 2017 at 11:27 AM, Jason Gunthorpe > <jgunthorpe@obsidianresearch.com> wrote: > > On Tue, Oct 10, 2017 at 01:17:26PM -0700, Dan Williams wrote: > > > >> Also keep in mind that what triggers the lease break is another > >> application trying to write or punch holes in a file that is mapped > >> for RDMA. So, if the hardware can't handle the iommu mapping getting > >> invalidated asynchronously and the application can't react in the > >> lease break timeout period then the administrator should arrange for > >> the file to not be written or truncated while it is mapped. > > > > That makes sense, but why not return ENOSYS or something to the app > > trying to alter the file if the RDMA hardware can't support this > > instead of having the RDMA app deal with this lease break weirdness? > > That's where I started, an inode flag that said "hands off, this file > is busy", but Christoph pointed out that we should reuse the same > mechanisms that pnfs is using. The pnfs protection scheme uses file > leases, and once the kernel decides that a lease needs to be broken / > layout needs to be recalled there is no stopping it, only delaying. That was just a suggestion - the important statement is that a hands off flag is just a no-go. > However, chatting this over with a few more people I have an alternate > solution that effectively behaves the same as how non-ODP hardware > handles this case of hole punch / truncation today. So, today if this > scenario happens on a page-cache backed mapping, the file blocks are > unmapped and the RDMA continues into pinned pages that are no longer > part of the file. We can achieve the same thing with the iommu, just > re-target the I/O into memory that isn't part of the file. That way > hardware does not see I/O errors and the DAX data consistency model is > no worse than the page-cache case. Yikes. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <20171013065047.GA26461-jcswGhMUV9g@public.gmane.org>]
* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() [not found] ` <20171013065047.GA26461-jcswGhMUV9g@public.gmane.org> @ 2017-10-13 15:03 ` Jason Gunthorpe [not found] ` <20171013150348.GA11257-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: Jason Gunthorpe @ 2017-10-13 15:03 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman, Joerg Roedel, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org, Dave Chinner, Robin Murphy, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Linux API, linux-fsdevel, David Woodhouse, Marek Szyprowski On Fri, Oct 13, 2017 at 08:50:47AM +0200, Christoph Hellwig wrote: > > However, chatting this over with a few more people I have an alternate > > solution that effectively behaves the same as how non-ODP hardware > > handles this case of hole punch / truncation today. So, today if this > > scenario happens on a page-cache backed mapping, the file blocks are > > unmapped and the RDMA continues into pinned pages that are no longer > > part of the file. We can achieve the same thing with the iommu, just > > re-target the I/O into memory that isn't part of the file. That way > > hardware does not see I/O errors and the DAX data consistency model is > > no worse than the page-cache case. > > Yikes. Well, as much as you say Yikes, Dan is correct, this does match the semantics RDMA MR's already have. They become non-coherent if their underlying object is changed, and there are many ways to get there. I've never thought about it, but it does sound like ftruncate, fallocate, etc on a normal file would break the MR coherency too?? There have been efforts in the past driven by the MPI people to create, essentially, something like lease-break' SIGIO. Except it was intended to be general, and wanted solve all the problems related with MR de-coherence. This was complicated and never became acceptable to mainline. Instead ODP was developed, and ODP actually solves all the problem sanely. Thinking about it some more, and with your other comments on get_user_pages in this thread, I tend to agree. It doesn't make sense to develop a user space lease break API for MR's that is a DAX specific feature. Along the some lines, it also doesn't make sense to force-invalidate MR's linked to DAX regions, while leaving MR's linked to other regions that have the same problem alone. If you want to make non-ODP MR's work better, then you need to have a general overall solution to tell userspace when the MR becomes (or I guess, is becoming) non-coherent, that covers all the cases that break MR coherence, not just via DAX. Otherwise, I think Dan is right, keeping the current semantic of having MRs just do something wrong, but not corrupt memory, when they loose coherence, is broadly consistent with how non-ODP MRs work today. Jason ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <20171013150348.GA11257-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() [not found] ` <20171013150348.GA11257-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2017-10-15 15:14 ` Matan Barak [not found] ` <CAAKD3BBR2CmQvg-3bqPog0VFrEm=QU-b-xBDH-_Q+sXV9NkFUA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: Matan Barak @ 2017-10-15 15:14 UTC (permalink / raw) To: Jason Gunthorpe Cc: Christoph Hellwig, Dan Williams, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma, Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler, David Woodhouse, Robin Murphy, Marek Szyprowski, Liran Liss On Fri, Oct 13, 2017 at 6:03 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: > On Fri, Oct 13, 2017 at 08:50:47AM +0200, Christoph Hellwig wrote: > >> > However, chatting this over with a few more people I have an alternate >> > solution that effectively behaves the same as how non-ODP hardware >> > handles this case of hole punch / truncation today. So, today if this >> > scenario happens on a page-cache backed mapping, the file blocks are >> > unmapped and the RDMA continues into pinned pages that are no longer >> > part of the file. We can achieve the same thing with the iommu, just >> > re-target the I/O into memory that isn't part of the file. That way >> > hardware does not see I/O errors and the DAX data consistency model is >> > no worse than the page-cache case. >> >> Yikes. > > Well, as much as you say Yikes, Dan is correct, this does match the > semantics RDMA MR's already have. They become non-coherent if their > underlying object is changed, and there are many ways to get there. > I've never thought about it, but it does sound like ftruncate, > fallocate, etc on a normal file would break the MR coherency too?? > > There have been efforts in the past driven by the MPI people to > create, essentially, something like lease-break' SIGIO. Except it was > intended to be general, and wanted solve all the problems related with > MR de-coherence. This was complicated and never became acceptable to > mainline. > > Instead ODP was developed, and ODP actually solves all the problem > sanely. > > Thinking about it some more, and with your other comments on > get_user_pages in this thread, I tend to agree. It doesn't make sense > to develop a user space lease break API for MR's that is a DAX > specific feature. > > Along the some lines, it also doesn't make sense to force-invalidate > MR's linked to DAX regions, while leaving MR's linked to other > regions that have the same problem alone. > > If you want to make non-ODP MR's work better, then you need to have a > general overall solution to tell userspace when the MR becomes (or I > guess, is becoming) non-coherent, that covers all the cases that break > MR coherence, not just via DAX. > > Otherwise, I think Dan is right, keeping the current semantic of > having MRs just do something wrong, but not corrupt memory, when they > loose coherence, is broadly consistent with how non-ODP MRs work today. > I agree, keeping the current semantics is probably the best thing we could do. It's a trade-off between breaking existing applications, having a new lease API for DAX or just failing DAX in particular (as opposed to other cases). For stable mappings, what we have is probably sufficient. For mappings which could be changed, it's unclear to me how you could guarantee non-racy behavior that is bounded by a pre-defined time and guarantee no user-space errors. On top of that, ODP (should) already solve that problem transparently. IMHO, using iommu for that and causing DMA errors just because the lease broke isn't the right thing to do. > Jason Matan > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <CAAKD3BBR2CmQvg-3bqPog0VFrEm=QU-b-xBDH-_Q+sXV9NkFUA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() [not found] ` <CAAKD3BBR2CmQvg-3bqPog0VFrEm=QU-b-xBDH-_Q+sXV9NkFUA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2017-10-15 15:21 ` Dan Williams 0 siblings, 0 replies; 49+ messages in thread From: Dan Williams @ 2017-10-15 15:21 UTC (permalink / raw) To: Matan Barak Cc: Jason Gunthorpe, Christoph Hellwig, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma, Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler, David Woodhouse, Robin Murphy, Marek Szyprowski, Liran Liss On Sun, Oct 15, 2017 at 8:14 AM, Matan Barak <matanb-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote: [..] > IMHO, using iommu for that and causing DMA errors just because the > lease broke isn't the right thing to do. Yes, see the current proposal over in this thread: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012885.html ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() [not found] ` <20171009191820.GD15336-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 2017-10-09 19:28 ` Dan Williams @ 2017-10-13 7:09 ` Christoph Hellwig 1 sibling, 0 replies; 49+ messages in thread From: Christoph Hellwig @ 2017-10-13 7:09 UTC (permalink / raw) To: Jason Gunthorpe Cc: Dan Williams, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler, David Woodhouse, Robin Murphy, Christoph Hellwig, Marek Szyprowski On Mon, Oct 09, 2017 at 01:18:20PM -0600, Jason Gunthorpe wrote: > > > If RDMA is driving this need, why not invalidate backing RDMA MRs > > > instead of requiring a IOMMU to do it? RDMA MR are finer grained and > > > do not suffer from the re-use problem David W. brought up with IOVAs.. > > > > Sounds promising. All I want in the end is to be sure that the kernel > > is enabled to stop any in-flight RDMA at will without asking > > userspace. Does this require per-RDMA driver opt-in or is there a > > common call that can be made? > > I don't think this has ever come up in the context of an all-device MR > invalidate requirement. Drivers already have code to invalidate > specifc MRs, but to find all MRs that touch certain pages and then > invalidate them would be new code. The whole point is that we should not need that IFF we provide the right interface. If we have a new 'register memory with a lease', the driver (or in fact probably the umem core for the drivers using it) has the lease associated with the ib_umem structure, which will just need a backpointer from the ib_umem to the to the MR to unregister it. Which might be a good opportunity to break the user MR from the in-kernel ones and merge it with ib_umem, but that's a different story.. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <150732931273.22363.8436792888326501071.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>]
* [PATCH v7 01/12] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags [not found] ` <150732931273.22363.8436792888326501071.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> @ 2017-10-06 22:35 ` Dan Williams 2017-10-06 22:35 ` [PATCH v7 02/12] fs, mm: pass fd to ->mmap_validate() Dan Williams ` (3 subsequent siblings) 4 siblings, 0 replies; 49+ messages in thread From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw) To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw Cc: Jan Kara, Arnd Bergmann, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Linus Torvalds, Christoph Hellwig The mmap(2) syscall suffers from the ABI anti-pattern of not validating unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a mechanism to define new behavior that is known to fail on older kernels without the support. Define a new MAP_SHARED_VALIDATE flag pattern that is guaranteed to fail on all legacy mmap implementations. It is worth noting that the original proposal was for a standalone MAP_VALIDATE flag. However, when that could not be supported by all archs Linus observed: I see why you *think* you want a bitmap. You think you want a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC etc, so that people can do ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_SYNC, fd, 0); and "know" that MAP_SYNC actually takes. And I'm saying that whole wish is bogus. You're fundamentally depending on special semantics, just make it explicit. It's already not portable, so don't try to make it so. Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value of 0x3, and make people do ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE | MAP_SYNC, fd, 0); and then the kernel side is easier too (none of that random garbage playing games with looking at the "MAP_VALIDATE bit", but just another case statement in that map type thing. Boom. Done. Similar to ->fallocate() we also want the ability to validate the support for new flags on a per ->mmap() 'struct file_operations' instance basis. Towards that end arrange for flags to be generically validated against a mmap_supported_mask exported by 'struct file_operations'. By default all existing flags are implicitly supported, but new flags require MAP_SHARED_VALIDATE and per-instance-opt-in. Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> Cc: Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org> Cc: Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Suggested-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> Suggested-by: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> --- arch/alpha/include/uapi/asm/mman.h | 1 + arch/mips/include/uapi/asm/mman.h | 1 + arch/mips/kernel/vdso.c | 2 + arch/parisc/include/uapi/asm/mman.h | 1 + arch/tile/mm/elf.c | 3 +- arch/xtensa/include/uapi/asm/mman.h | 1 + include/linux/fs.h | 2 + include/linux/mm.h | 2 + include/linux/mman.h | 39 ++++++++++++++++++++++++++ include/uapi/asm-generic/mman-common.h | 1 + mm/mmap.c | 21 ++++++++++++-- tools/include/uapi/asm-generic/mman-common.h | 1 + 12 files changed, 69 insertions(+), 6 deletions(-) diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index 3b26cc62dadb..92823f24890b 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -14,6 +14,7 @@ #define MAP_TYPE 0x0f /* Mask for type of mapping (OSF/1 is _wrong_) */ #define MAP_FIXED 0x100 /* Interpret addr exactly */ #define MAP_ANONYMOUS 0x10 /* don't use a file */ +#define MAP_SHARED_VALIDATE 0x3 /* share + validate extension flags */ /* not used by linux, but here to make sure we don't clash with OSF/1 defines */ #define _MAP_HASSEMAPHORE 0x0200 diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h index da3216007fe0..c77689076577 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h @@ -30,6 +30,7 @@ #define MAP_PRIVATE 0x002 /* Changes are private */ #define MAP_TYPE 0x00f /* Mask for type of mapping */ #define MAP_FIXED 0x010 /* Interpret addr exactly */ +#define MAP_SHARED_VALIDATE 0x3 /* share + validate extension flags */ /* not used by linux, but here to make sure we don't clash with ABI defines */ #define MAP_RENAME 0x020 /* Assign page to file */ diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c index 019035d7225c..cf10654477a9 100644 --- a/arch/mips/kernel/vdso.c +++ b/arch/mips/kernel/vdso.c @@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) base = mmap_region(NULL, STACK_TOP, PAGE_SIZE, VM_READ|VM_WRITE|VM_EXEC| VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, - 0, NULL); + 0, NULL, 0); if (IS_ERR_VALUE(base)) { ret = base; goto out; diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index 775b5d5e41a1..36b688d52de3 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -14,6 +14,7 @@ #define MAP_TYPE 0x03 /* Mask for type of mapping */ #define MAP_FIXED 0x04 /* Interpret addr exactly */ #define MAP_ANONYMOUS 0x10 /* don't use a file */ +#define MAP_SHARED_VALIDATE 0x3 /* share + validate extension flags */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ #define MAP_EXECUTABLE 0x1000 /* mark it as an executable */ diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c index 889901824400..5ffcbe76aef9 100644 --- a/arch/tile/mm/elf.c +++ b/arch/tile/mm/elf.c @@ -143,7 +143,8 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, unsigned long addr = MEM_USER_INTRPT; addr = mmap_region(NULL, addr, INTRPT_SIZE, VM_READ|VM_EXEC| - VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0, NULL); + VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0, + NULL, 0); if (addr > (unsigned long) -PAGE_SIZE) retval = (int) addr; } diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h index b15b278aa314..ec597900eec7 100644 --- a/arch/xtensa/include/uapi/asm/mman.h +++ b/arch/xtensa/include/uapi/asm/mman.h @@ -37,6 +37,7 @@ #define MAP_PRIVATE 0x002 /* Changes are private */ #define MAP_TYPE 0x00f /* Mask for type of mapping */ #define MAP_FIXED 0x010 /* Interpret addr exactly */ +#define MAP_SHARED_VALIDATE 0x3 /* share + validate extension flags */ /* not used by linux, but here to make sure we don't clash with ABI defines */ #define MAP_RENAME 0x020 /* Assign page to file */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 339e73742e73..51538958f7f5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1701,6 +1701,8 @@ struct file_operations { long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); + int (*mmap_validate) (struct file *, struct vm_area_struct *, + unsigned long); int (*open) (struct inode *, struct file *); int (*flush) (struct file *, fl_owner_t id); int (*release) (struct inode *, struct file *); diff --git a/include/linux/mm.h b/include/linux/mm.h index f8c10d336e42..5c4c98e4adc9 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2133,7 +2133,7 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo extern unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, - struct list_head *uf); + struct list_head *uf, unsigned long map_flags); extern unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate, diff --git a/include/linux/mman.h b/include/linux/mman.h index c8367041fafd..94b63b4d71ff 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -7,6 +7,45 @@ #include <linux/atomic.h> #include <uapi/linux/mman.h> +/* + * Arrange for legacy / undefined architecture specific flags to be + * ignored by default in LEGACY_MAP_MASK. + */ +#ifndef MAP_32BIT +#define MAP_32BIT 0 +#endif +#ifndef MAP_HUGE_2MB +#define MAP_HUGE_2MB 0 +#endif +#ifndef MAP_HUGE_1GB +#define MAP_HUGE_1GB 0 +#endif +#ifndef MAP_UNINITIALIZED +#define MAP_UNINITIALIZED 0 +#endif + +/* + * The historical set of flags that all mmap implementations implicitly + * support when a ->mmap_validate() op is not provided in file_operations. + */ +#define LEGACY_MAP_MASK (MAP_SHARED \ + | MAP_PRIVATE \ + | MAP_FIXED \ + | MAP_ANONYMOUS \ + | MAP_DENYWRITE \ + | MAP_EXECUTABLE \ + | MAP_UNINITIALIZED \ + | MAP_GROWSDOWN \ + | MAP_LOCKED \ + | MAP_NORESERVE \ + | MAP_POPULATE \ + | MAP_NONBLOCK \ + | MAP_STACK \ + | MAP_HUGETLB \ + | MAP_32BIT \ + | MAP_HUGE_2MB \ + | MAP_HUGE_1GB) + extern int sysctl_overcommit_memory; extern int sysctl_overcommit_ratio; extern unsigned long sysctl_overcommit_kbytes; diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 203268f9231e..ac55d1c0ec0f 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -24,6 +24,7 @@ #else # define MAP_UNINITIALIZED 0x0 /* Don't support this flag */ #endif +#define MAP_SHARED_VALIDATE 0x3 /* share + validate extension flags */ /* * Flags for mlock diff --git a/mm/mmap.c b/mm/mmap.c index 680506faceae..a1bcaa9eff42 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1389,6 +1389,18 @@ unsigned long do_mmap(struct file *file, unsigned long addr, struct inode *inode = file_inode(file); switch (flags & MAP_TYPE) { + case (MAP_SHARED_VALIDATE): + if ((flags & ~LEGACY_MAP_MASK) == 0) { + /* + * If all legacy mmap flags, downgrade + * to MAP_SHARED, i.e. invoke ->mmap() + * instead of ->mmap_validate() + */ + flags &= ~MAP_TYPE; + flags |= MAP_SHARED; + } else if (!file->f_op->mmap_validate) + return -EOPNOTSUPP; + /* fall through */ case MAP_SHARED: if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE)) return -EACCES; @@ -1465,7 +1477,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, vm_flags |= VM_NORESERVE; } - addr = mmap_region(file, addr, len, vm_flags, pgoff, uf); + addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags); if (!IS_ERR_VALUE(addr) && ((vm_flags & VM_LOCKED) || (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE)) @@ -1602,7 +1614,7 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags) unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, - struct list_head *uf) + struct list_head *uf, unsigned long map_flags) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma, *prev; @@ -1687,7 +1699,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr, * new file must not have been exposed to user-space, yet. */ vma->vm_file = get_file(file); - error = call_mmap(file, vma); + if ((map_flags & MAP_TYPE) == MAP_SHARED_VALIDATE) + error = file->f_op->mmap_validate(file, vma, map_flags); + else + error = call_mmap(file, vma); if (error) goto unmap_and_free_vma; diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h index 8c27db0c5c08..202bc4277fb5 100644 --- a/tools/include/uapi/asm-generic/mman-common.h +++ b/tools/include/uapi/asm-generic/mman-common.h @@ -24,6 +24,7 @@ #else # define MAP_UNINITIALIZED 0x0 /* Don't support this flag */ #endif +#define MAP_SHARED_VALIDATE 0x3 /* share + validate extension flags */ /* * Flags for mlock -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v7 02/12] fs, mm: pass fd to ->mmap_validate() [not found] ` <150732931273.22363.8436792888326501071.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> 2017-10-06 22:35 ` [PATCH v7 01/12] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags Dan Williams @ 2017-10-06 22:35 ` Dan Williams 2017-10-06 22:35 ` [PATCH v7 04/12] fs: MAP_DIRECT core Dan Williams ` (2 subsequent siblings) 4 siblings, 0 replies; 49+ messages in thread From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw) To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw Cc: Jan Kara, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, Dave Chinner, linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Christoph Hellwig The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent block map changes while the file is mapped. It requires the fd to setup an fasync_struct for signalling lease break events to the lease holder. Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> Cc: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> Cc: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> Cc: "Darrick J. Wong" <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> --- arch/mips/kernel/vdso.c | 2 +- arch/tile/mm/elf.c | 2 +- arch/x86/mm/mpx.c | 3 ++- fs/aio.c | 2 +- include/linux/fs.h | 2 +- include/linux/mm.h | 9 +++++---- ipc/shm.c | 3 ++- mm/internal.h | 2 +- mm/mmap.c | 13 +++++++------ mm/nommu.c | 5 +++-- mm/util.c | 7 ++++--- 11 files changed, 28 insertions(+), 22 deletions(-) diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c index cf10654477a9..ab26c7ac0316 100644 --- a/arch/mips/kernel/vdso.c +++ b/arch/mips/kernel/vdso.c @@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) base = mmap_region(NULL, STACK_TOP, PAGE_SIZE, VM_READ|VM_WRITE|VM_EXEC| VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, - 0, NULL, 0); + 0, NULL, 0, -1); if (IS_ERR_VALUE(base)) { ret = base; goto out; diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c index 5ffcbe76aef9..61a9588e141a 100644 --- a/arch/tile/mm/elf.c +++ b/arch/tile/mm/elf.c @@ -144,7 +144,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, addr = mmap_region(NULL, addr, INTRPT_SIZE, VM_READ|VM_EXEC| VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0, - NULL, 0); + NULL, 0, -1); if (addr > (unsigned long) -PAGE_SIZE) retval = (int) addr; } diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c index 9ceaa955d2ba..a8baa94a496b 100644 --- a/arch/x86/mm/mpx.c +++ b/arch/x86/mm/mpx.c @@ -52,7 +52,8 @@ static unsigned long mpx_mmap(unsigned long len) down_write(&mm->mmap_sem); addr = do_mmap(NULL, 0, len, PROT_READ | PROT_WRITE, - MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, &populate, NULL); + MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, &populate, + NULL, -1); up_write(&mm->mmap_sem); if (populate) mm_populate(addr, populate); diff --git a/fs/aio.c b/fs/aio.c index 5a2487217072..d10ca6db2ee6 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -519,7 +519,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size, PROT_READ | PROT_WRITE, - MAP_SHARED, 0, &unused, NULL); + MAP_SHARED, 0, &unused, NULL, -1); up_write(&mm->mmap_sem); if (IS_ERR((void *)ctx->mmap_base)) { ctx->mmap_size = 0; diff --git a/include/linux/fs.h b/include/linux/fs.h index 51538958f7f5..c2b9bf3dc4e9 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1702,7 +1702,7 @@ struct file_operations { long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*mmap_validate) (struct file *, struct vm_area_struct *, - unsigned long); + unsigned long, int); int (*open) (struct inode *, struct file *); int (*flush) (struct file *, fl_owner_t id); int (*release) (struct inode *, struct file *); diff --git a/include/linux/mm.h b/include/linux/mm.h index 5c4c98e4adc9..0afa19feb755 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2133,11 +2133,11 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo extern unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, - struct list_head *uf, unsigned long map_flags); + struct list_head *uf, unsigned long map_flags, int fd); extern unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate, - struct list_head *uf); + struct list_head *uf, int fd); extern int do_munmap(struct mm_struct *, unsigned long, size_t, struct list_head *uf); @@ -2145,9 +2145,10 @@ static inline unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, unsigned long pgoff, unsigned long *populate, - struct list_head *uf) + struct list_head *uf, int fd) { - return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate, uf); + return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate, + uf, fd); } #ifdef CONFIG_MMU diff --git a/ipc/shm.c b/ipc/shm.c index 1e2b1692ba2c..585e05eef40a 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -1399,7 +1399,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg, goto invalid; } - addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate, NULL); + addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate, + NULL, -1); *raddr = addr; err = 0; if (IS_ERR_VALUE(addr)) diff --git a/mm/internal.h b/mm/internal.h index 1df011f62480..70ed7b06dd85 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -466,7 +466,7 @@ extern u32 hwpoison_filter_enable; extern unsigned long __must_check vm_mmap_pgoff(struct file *, unsigned long, unsigned long, unsigned long, - unsigned long, unsigned long); + unsigned long, unsigned long, int); extern void set_pageblock_order(void); unsigned long reclaim_clean_pages_from_list(struct zone *zone, diff --git a/mm/mmap.c b/mm/mmap.c index a1bcaa9eff42..c2cb6334a7a9 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1322,7 +1322,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate, - struct list_head *uf) + struct list_head *uf, int fd) { struct mm_struct *mm = current->mm; int pkey = 0; @@ -1477,7 +1477,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, vm_flags |= VM_NORESERVE; } - addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags); + addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags, fd); if (!IS_ERR_VALUE(addr) && ((vm_flags & VM_LOCKED) || (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE)) @@ -1527,7 +1527,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len, flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE); - retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff); + retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff, fd); out_fput: if (file) fput(file); @@ -1614,7 +1614,7 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags) unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, - struct list_head *uf, unsigned long map_flags) + struct list_head *uf, unsigned long map_flags, int fd) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma, *prev; @@ -1700,7 +1700,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr, */ vma->vm_file = get_file(file); if ((map_flags & MAP_TYPE) == MAP_SHARED_VALIDATE) - error = file->f_op->mmap_validate(file, vma, map_flags); + error = file->f_op->mmap_validate(file, vma, + map_flags, fd); else error = call_mmap(file, vma); if (error) @@ -2842,7 +2843,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, file = get_file(vma->vm_file); ret = do_mmap_pgoff(vma->vm_file, start, size, - prot, flags, pgoff, &populate, NULL); + prot, flags, pgoff, &populate, NULL, -1); fput(file); out: up_write(&mm->mmap_sem); diff --git a/mm/nommu.c b/mm/nommu.c index 17c00d93de2e..952d205d3b66 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1206,7 +1206,8 @@ unsigned long do_mmap(struct file *file, vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate, - struct list_head *uf) + struct list_head *uf, + int fd) { struct vm_area_struct *vma; struct vm_region *region; @@ -1439,7 +1440,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len, flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE); - retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff); + retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff, fd); if (file) fput(file); diff --git a/mm/util.c b/mm/util.c index 34e57fae959d..dcf48d929185 100644 --- a/mm/util.c +++ b/mm/util.c @@ -319,7 +319,7 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast); unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, - unsigned long flag, unsigned long pgoff) + unsigned long flag, unsigned long pgoff, int fd) { unsigned long ret; struct mm_struct *mm = current->mm; @@ -331,7 +331,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr, if (down_write_killable(&mm->mmap_sem)) return -EINTR; ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff, - &populate, &uf); + &populate, &uf, fd); up_write(&mm->mmap_sem); userfaultfd_unmap_complete(mm, &uf); if (populate) @@ -349,7 +349,8 @@ unsigned long vm_mmap(struct file *file, unsigned long addr, if (unlikely(offset_in_page(offset))) return -EINVAL; - return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT); + return vm_mmap_pgoff(file, addr, len, prot, flag, + offset >> PAGE_SHIFT, -1); } EXPORT_SYMBOL(vm_mmap); ^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v7 04/12] fs: MAP_DIRECT core [not found] ` <150732931273.22363.8436792888326501071.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> 2017-10-06 22:35 ` [PATCH v7 01/12] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags Dan Williams 2017-10-06 22:35 ` [PATCH v7 02/12] fs, mm: pass fd to ->mmap_validate() Dan Williams @ 2017-10-06 22:35 ` Dan Williams 2017-10-06 22:36 ` [PATCH v7 08/12] fs, mapdirect: introduce ->lease_direct() Dan Williams 2017-10-06 22:36 ` [PATCH v7 10/12] device-dax: wire up ->lease_direct() Dan Williams 4 siblings, 0 replies; 49+ messages in thread From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw) To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw Cc: linux-xfs-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Jeff Moyer, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jeff Layton, Ross Zwisler Introduce a set of helper apis for filesystems to establish FL_LAYOUT leases to protect against writes and block map updates while a MAP_DIRECT mapping is established. While the lease protects against the syscall write path and fallocate it does not protect against allocating write-faults, so this relies on i_mapdcount to disable block map updates from write faults. Like the pnfs case MAP_DIRECT does its own timeout of the lease since we need to have a process context for running map_direct_invalidate(). Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> Cc: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> Cc: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> Cc: "Darrick J. Wong" <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> Cc: Jeff Layton <jlayton-vpEMnDpepFuMZCB2o+C8xQ@public.gmane.org> Cc: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> --- fs/Makefile | 2 fs/mapdirect.c | 232 +++++++++++++++++++++++++++++++++++++++++++++ include/linux/mapdirect.h | 45 +++++++++ 3 files changed, 278 insertions(+), 1 deletion(-) create mode 100644 fs/mapdirect.c create mode 100644 include/linux/mapdirect.h diff --git a/fs/Makefile b/fs/Makefile index 7bbaca9c67b1..c0e791d235d8 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -29,7 +29,7 @@ obj-$(CONFIG_TIMERFD) += timerfd.o obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_AIO) += aio.o -obj-$(CONFIG_FS_DAX) += dax.o +obj-$(CONFIG_FS_DAX) += dax.o mapdirect.o obj-$(CONFIG_FS_ENCRYPTION) += crypto/ obj-$(CONFIG_FILE_LOCKING) += locks.o obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o diff --git a/fs/mapdirect.c b/fs/mapdirect.c new file mode 100644 index 000000000000..9ac7c1d946a2 --- /dev/null +++ b/fs/mapdirect.c @@ -0,0 +1,232 @@ +/* + * Copyright(c) 2017 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#include <linux/mapdirect.h> +#include <linux/workqueue.h> +#include <linux/signal.h> +#include <linux/mutex.h> +#include <linux/sched.h> +#include <linux/slab.h> +#include <linux/fs.h> +#include <linux/mm.h> + +#define MAPDIRECT_BREAK 0 +#define MAPDIRECT_VALID 1 + +struct map_direct_state { + atomic_t mds_ref; + atomic_t mds_vmaref; + unsigned long mds_state; + struct inode *mds_inode; + struct delayed_work mds_work; + struct fasync_struct *mds_fa; + struct vm_area_struct *mds_vma; +}; + +bool is_map_direct_valid(struct map_direct_state *mds) +{ + return test_bit(MAPDIRECT_VALID, &mds->mds_state); +} +EXPORT_SYMBOL_GPL(is_map_direct_valid); + +static void put_map_direct(struct map_direct_state *mds) +{ + if (!atomic_dec_and_test(&mds->mds_ref)) + return; + kfree(mds); +} + +int put_map_direct_vma(struct map_direct_state *mds) +{ + struct vm_area_struct *vma = mds->mds_vma; + struct file *file = vma->vm_file; + struct inode *inode = file_inode(file); + void *owner = mds; + + if (!atomic_dec_and_test(&mds->mds_vmaref)) + return 0; + + /* + * Flush in-flight+forced lm_break events that may be + * referencing this dying vma. + */ + mds->mds_vma = NULL; + set_bit(MAPDIRECT_BREAK, &mds->mds_state); + vfs_setlease(vma->vm_file, F_UNLCK, NULL, &owner); + flush_delayed_work(&mds->mds_work); + iput(inode); + + put_map_direct(mds); + return 1; +} +EXPORT_SYMBOL_GPL(put_map_direct_vma); + +void get_map_direct_vma(struct map_direct_state *mds) +{ + atomic_inc(&mds->mds_vmaref); +} +EXPORT_SYMBOL_GPL(get_map_direct_vma); + +static void map_direct_invalidate(struct work_struct *work) +{ + struct map_direct_state *mds; + struct vm_area_struct *vma; + struct inode *inode; + void *owner; + + mds = container_of(work, typeof(*mds), mds_work.work); + + clear_bit(MAPDIRECT_VALID, &mds->mds_state); + + vma = ACCESS_ONCE(mds->mds_vma); + inode = mds->mds_inode; + if (vma) { + unsigned long len = vma->vm_end - vma->vm_start; + loff_t start = (loff_t) vma->vm_pgoff * PAGE_SIZE; + + unmap_mapping_range(inode->i_mapping, start, len, 1); + } + owner = mds; + vfs_setlease(vma->vm_file, F_UNLCK, NULL, &owner); + + put_map_direct(mds); +} + +static bool map_direct_lm_break(struct file_lock *fl) +{ + struct map_direct_state *mds = fl->fl_owner; + + /* + * Given that we need to take sleeping locks to invalidate the + * mapping we schedule that work with the original timeout set + * by the file-locks core. Then we tell the core to hold off on + * continuing with the lease break until the delayed work + * completes the invalidation and the lease unlock. + * + * Note that this assumes that i_mapdcount is protecting against + * block-map modifying write-faults since we are unable to use + * leases in that path due to locking constraints. + */ + if (!test_and_set_bit(MAPDIRECT_BREAK, &mds->mds_state)) { + schedule_delayed_work(&mds->mds_work, lease_break_time * HZ); + kill_fasync(&fl->fl_fasync, SIGIO, POLL_MSG); + } + + /* Tell the core lease code to wait for delayed work completion */ + fl->fl_break_time = 0; + + return false; +} + +static int map_direct_lm_change(struct file_lock *fl, int arg, + struct list_head *dispose) +{ + struct map_direct_state *mds = fl->fl_owner; + + WARN_ON(!(arg & F_UNLCK)); + + i_mapdcount_dec(mds->mds_inode); + return lease_modify(fl, arg, dispose); +} + +static void map_direct_lm_setup(struct file_lock *fl, void **priv) +{ + struct file *file = fl->fl_file; + struct map_direct_state *mds = *priv; + struct fasync_struct *fa = mds->mds_fa; + + /* + * Comment copied from lease_setup(): + * fasync_insert_entry() returns the old entry if any. If there was no + * old entry, then it used "priv" and inserted it into the fasync list. + * Clear the pointer to indicate that it shouldn't be freed. + */ + if (!fasync_insert_entry(fa->fa_fd, file, &fl->fl_fasync, fa)) + *priv = NULL; + + __f_setown(file, task_pid(current), PIDTYPE_PID, 0); +} + +static const struct lock_manager_operations map_direct_lm_ops = { + .lm_break = map_direct_lm_break, + .lm_change = map_direct_lm_change, + .lm_setup = map_direct_lm_setup, +}; + +struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma) +{ + struct map_direct_state *mds = kzalloc(sizeof(*mds), GFP_KERNEL); + struct file *file = vma->vm_file; + struct inode *inode = file_inode(file); + struct fasync_struct *fa; + struct file_lock *fl; + void *owner = mds; + int rc = -ENOMEM; + + if (!mds) + return ERR_PTR(-ENOMEM); + + mds->mds_vma = vma; + atomic_set(&mds->mds_ref, 1); + atomic_set(&mds->mds_vmaref, 1); + set_bit(MAPDIRECT_VALID, &mds->mds_state); + mds->mds_inode = inode; + ihold(inode); + INIT_DELAYED_WORK(&mds->mds_work, map_direct_invalidate); + + fa = fasync_alloc(); + if (!fa) + goto err_fasync_alloc; + mds->mds_fa = fa; + fa->fa_fd = fd; + + fl = locks_alloc_lock(); + if (!fl) + goto err_lock_alloc; + + locks_init_lock(fl); + fl->fl_lmops = &map_direct_lm_ops; + fl->fl_flags = FL_LAYOUT; + fl->fl_type = F_RDLCK; + fl->fl_end = OFFSET_MAX; + fl->fl_owner = mds; + atomic_inc(&mds->mds_ref); + fl->fl_pid = current->tgid; + fl->fl_file = file; + + rc = vfs_setlease(file, fl->fl_type, &fl, &owner); + if (rc) + goto err_setlease; + if (fl) { + WARN_ON(1); + owner = mds; + vfs_setlease(file, F_UNLCK, NULL, &owner); + owner = NULL; + rc = -ENXIO; + goto err_setlease; + } + + i_mapdcount_inc(inode); + return mds; + +err_setlease: + locks_free_lock(fl); +err_lock_alloc: + /* if owner is NULL then the lease machinery is reponsible @fa */ + if (owner) + fasync_free(fa); +err_fasync_alloc: + iput(inode); + kfree(mds); + return ERR_PTR(rc); +} +EXPORT_SYMBOL_GPL(map_direct_register); diff --git a/include/linux/mapdirect.h b/include/linux/mapdirect.h new file mode 100644 index 000000000000..724e27d8615e --- /dev/null +++ b/include/linux/mapdirect.h @@ -0,0 +1,45 @@ +/* + * Copyright(c) 2017 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#ifndef __MAPDIRECT_H__ +#define __MAPDIRECT_H__ +#include <linux/err.h> + +struct inode; +struct work_struct; +struct vm_area_struct; +struct map_direct_state; + +#if IS_ENABLED(CONFIG_FS_DAX) +struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma); +int put_map_direct_vma(struct map_direct_state *mds); +void get_map_direct_vma(struct map_direct_state *mds); +bool is_map_direct_valid(struct map_direct_state *mds); +#else +static inline struct map_direct_state *map_direct_register(int fd, + struct vm_area_struct *vma) +{ + return ERR_PTR(-EOPNOTSUPP); +} +int put_map_direct_vma(struct map_direct_state *mds) +{ + return 0; +} +static inline void get_map_direct_vma(struct map_direct_state *mds) +{ +} +bool is_map_direct_valid(struct map_direct_state *mds) +{ + return false; +} +#endif +#endif /* __MAPDIRECT_H__ */ -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v7 08/12] fs, mapdirect: introduce ->lease_direct() [not found] ` <150732931273.22363.8436792888326501071.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> ` (2 preceding siblings ...) 2017-10-06 22:35 ` [PATCH v7 04/12] fs: MAP_DIRECT core Dan Williams @ 2017-10-06 22:36 ` Dan Williams 2017-10-06 22:36 ` [PATCH v7 10/12] device-dax: wire up ->lease_direct() Dan Williams 4 siblings, 0 replies; 49+ messages in thread From: Dan Williams @ 2017-10-06 22:36 UTC (permalink / raw) To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw Cc: linux-xfs-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Jeff Moyer, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jeff Layton, Ross Zwisler Provide a vma operation that registers a lease that is broken by break_layout(). This is motivated by a need to stop in-progress RDMA when the block-map of a DAX-file changes. I.e. since DAX gives direct-access to filesystem blocks we can not allow those blocks to move or change state while they are under active RDMA. So, if the filesystem determines it needs to move blocks it can revoke device access before proceeding. Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> Cc: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> Cc: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> Cc: "Darrick J. Wong" <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> Cc: Jeff Layton <jlayton-vpEMnDpepFuMZCB2o+C8xQ@public.gmane.org> Cc: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> --- fs/mapdirect.c | 117 +++++++++++++++++++++++++++++++++++++++++++++ include/linux/mapdirect.h | 23 +++++++++ include/linux/mm.h | 6 ++ 3 files changed, 146 insertions(+) diff --git a/fs/mapdirect.c b/fs/mapdirect.c index 9ac7c1d946a2..338cbe055fc7 100644 --- a/fs/mapdirect.c +++ b/fs/mapdirect.c @@ -16,6 +16,7 @@ #include <linux/mutex.h> #include <linux/sched.h> #include <linux/slab.h> +#include <linux/file.h> #include <linux/fs.h> #include <linux/mm.h> @@ -32,12 +33,26 @@ struct map_direct_state { struct vm_area_struct *mds_vma; }; +struct lease_direct_state { + void *lds_owner; + struct file *lds_file; + unsigned long lds_state; + void (*lds_break_fn)(void *lds_owner); + struct work_struct lds_work; +}; + bool is_map_direct_valid(struct map_direct_state *mds) { return test_bit(MAPDIRECT_VALID, &mds->mds_state); } EXPORT_SYMBOL_GPL(is_map_direct_valid); +bool is_map_direct_broken(struct map_direct_state *mds) +{ + return test_bit(MAPDIRECT_BREAK, &mds->mds_state); +} +EXPORT_SYMBOL_GPL(is_map_direct_broken); + static void put_map_direct(struct map_direct_state *mds) { if (!atomic_dec_and_test(&mds->mds_ref)) @@ -162,6 +177,108 @@ static const struct lock_manager_operations map_direct_lm_ops = { .lm_setup = map_direct_lm_setup, }; +static void lease_direct_invalidate(struct work_struct *work) +{ + struct lease_direct_state *lds; + void *owner; + + lds = container_of(work, typeof(*lds), lds_work); + owner = lds; + lds->lds_break_fn(lds->lds_owner); + vfs_setlease(lds->lds_file, F_UNLCK, NULL, &owner); +} + +static bool lease_direct_lm_break(struct file_lock *fl) +{ + struct lease_direct_state *lds = fl->fl_owner; + + if (!test_and_set_bit(MAPDIRECT_BREAK, &lds->lds_state)) + schedule_work(&lds->lds_work); + return false; +} + +static int lease_direct_lm_change(struct file_lock *fl, int arg, + struct list_head *dispose) +{ + WARN_ON(!(arg & F_UNLCK)); + return lease_modify(fl, arg, dispose); +} + +static const struct lock_manager_operations lease_direct_lm_ops = { + .lm_break = lease_direct_lm_break, + .lm_change = lease_direct_lm_change, +}; + +struct lease_direct *map_direct_lease(struct vm_area_struct *vma, + void (*lds_break_fn)(void *), void *lds_owner) +{ + struct file *file = vma->vm_file; + struct lease_direct_state *lds; + struct lease_direct *ld; + struct file_lock *fl; + int rc = -ENOMEM; + void *owner; + + ld = kzalloc(sizeof(*ld) + sizeof(*lds), GFP_KERNEL); + if (!ld) + return ERR_PTR(-ENOMEM); + INIT_LIST_HEAD(&ld->list); + lds = (struct lease_direct_state *)(ld + 1); + owner = lds; + ld->lds = lds; + lds->lds_break_fn = lds_break_fn; + lds->lds_owner = lds_owner; + INIT_WORK(&lds->lds_work, lease_direct_invalidate); + lds->lds_file = get_file(file); + + fl = locks_alloc_lock(); + if (!fl) + goto err_lock_alloc; + + locks_init_lock(fl); + fl->fl_lmops = &lease_direct_lm_ops; + fl->fl_flags = FL_LAYOUT; + fl->fl_type = F_RDLCK; + fl->fl_end = OFFSET_MAX; + fl->fl_owner = lds; + fl->fl_pid = current->tgid; + fl->fl_file = file; + + rc = vfs_setlease(file, fl->fl_type, &fl, &owner); + if (rc) + goto err_setlease; + if (fl) { + WARN_ON(1); + owner = lds; + vfs_setlease(file, F_UNLCK, NULL, &owner); + owner = NULL; + rc = -ENXIO; + goto err_setlease; + } + + return ld; +err_setlease: + locks_free_lock(fl); +err_lock_alloc: + kfree(lds); + return ERR_PTR(rc); +} +EXPORT_SYMBOL_GPL(map_direct_lease); + +void map_direct_lease_destroy(struct lease_direct *ld) +{ + struct lease_direct_state *lds = ld->lds; + struct file *file = lds->lds_file; + void *owner = lds; + + vfs_setlease(file, F_UNLCK, NULL, &owner); + flush_work(&lds->lds_work); + fput(file); + WARN_ON(!list_empty(&ld->list)); + kfree(ld); +} +EXPORT_SYMBOL_GPL(map_direct_lease_destroy); + struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma) { struct map_direct_state *mds = kzalloc(sizeof(*mds), GFP_KERNEL); diff --git a/include/linux/mapdirect.h b/include/linux/mapdirect.h index 724e27d8615e..dc4d4ba677d0 100644 --- a/include/linux/mapdirect.h +++ b/include/linux/mapdirect.h @@ -13,17 +13,28 @@ #ifndef __MAPDIRECT_H__ #define __MAPDIRECT_H__ #include <linux/err.h> +#include <linux/list.h> struct inode; struct work_struct; struct vm_area_struct; struct map_direct_state; +struct list_direct_state; + +struct lease_direct { + struct list_head list; + struct lease_direct_state *lds; +}; #if IS_ENABLED(CONFIG_FS_DAX) struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma); int put_map_direct_vma(struct map_direct_state *mds); void get_map_direct_vma(struct map_direct_state *mds); bool is_map_direct_valid(struct map_direct_state *mds); +bool is_map_direct_broken(struct map_direct_state *mds); +struct lease_direct *map_direct_lease(struct vm_area_struct *vma, + void (*ld_break_fn)(void *), void *ld_owner); +void map_direct_lease_destroy(struct lease_direct *ld); #else static inline struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma) @@ -41,5 +52,17 @@ bool is_map_direct_valid(struct map_direct_state *mds) { return false; } +bool is_map_direct_broken(struct map_direct_state *mds) +{ + return false; +} +struct lease_direct *map_direct_lease(struct vm_area_struct *vma, + void (*ld_break_fn)(void *), void *ld_owner) +{ + return ERR_PTR(-EOPNOTSUPP); +} +void map_direct_lease_destroy(struct lease_direct *ld) +{ +} #endif #endif /* __MAPDIRECT_H__ */ diff --git a/include/linux/mm.h b/include/linux/mm.h index 0afa19feb755..d03953f91ce8 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -420,6 +420,12 @@ struct vm_operations_struct { */ struct page *(*find_special_page)(struct vm_area_struct *vma, unsigned long addr); + /* + * Called by rdma memory registration to subscribe for "break" + * events that require any ongoing rdma accesses to quiesce. + */ + struct lease_direct *(*lease_direct)(struct vm_area_struct *vma, + void (*break_fn)(void *), void *owner); }; struct mmu_gather; -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v7 10/12] device-dax: wire up ->lease_direct() [not found] ` <150732931273.22363.8436792888326501071.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> ` (3 preceding siblings ...) 2017-10-06 22:36 ` [PATCH v7 08/12] fs, mapdirect: introduce ->lease_direct() Dan Williams @ 2017-10-06 22:36 ` Dan Williams 4 siblings, 0 replies; 49+ messages in thread From: Dan Williams @ 2017-10-06 22:36 UTC (permalink / raw) To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw Cc: Jan Kara, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Jeff Moyer, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Ross Zwisler, Christoph Hellwig The only event that will break a lease_direct lease in the device-dax case is the device shutdown path where the physical pages might get assigned to another device. Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> Cc: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> --- drivers/dax/device.c | 4 ++++ fs/Kconfig | 4 ++++ fs/Makefile | 3 ++- include/linux/mapdirect.h | 2 +- 4 files changed, 11 insertions(+), 2 deletions(-) diff --git a/drivers/dax/device.c b/drivers/dax/device.c index e9f3b3e4bbf4..fa75004185c4 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -10,6 +10,7 @@ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * General Public License for more details. */ +#include <linux/mapdirect.h> #include <linux/pagemap.h> #include <linux/module.h> #include <linux/device.h> @@ -430,6 +431,7 @@ static int dev_dax_fault(struct vm_fault *vmf) static const struct vm_operations_struct dax_vm_ops = { .fault = dev_dax_fault, .huge_fault = dev_dax_huge_fault, + .lease_direct = map_direct_lease, }; static int dax_mmap(struct file *filp, struct vm_area_struct *vma) @@ -540,8 +542,10 @@ static void kill_dev_dax(struct dev_dax *dev_dax) { struct dax_device *dax_dev = dev_dax->dax_dev; struct inode *inode = dax_inode(dax_dev); + const bool wait = true; kill_dax(dax_dev); + break_layout(inode, wait); unmap_mapping_range(inode->i_mapping, 0, 0, 1); } diff --git a/fs/Kconfig b/fs/Kconfig index 7aee6d699fd6..2e3784ae1bc4 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -58,6 +58,10 @@ config FS_DAX_PMD depends on ZONE_DEVICE depends on TRANSPARENT_HUGEPAGE +config DAX_MAP_DIRECT + bool + default FS_DAX || DEV_DAX + endif # BLOCK # Posix ACL utility routines diff --git a/fs/Makefile b/fs/Makefile index c0e791d235d8..21b8fb104656 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -29,7 +29,8 @@ obj-$(CONFIG_TIMERFD) += timerfd.o obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_AIO) += aio.o -obj-$(CONFIG_FS_DAX) += dax.o mapdirect.o +obj-$(CONFIG_FS_DAX) += dax.o +obj-$(CONFIG_DAX_MAP_DIRECT) += mapdirect.o obj-$(CONFIG_FS_ENCRYPTION) += crypto/ obj-$(CONFIG_FILE_LOCKING) += locks.o obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o diff --git a/include/linux/mapdirect.h b/include/linux/mapdirect.h index dc4d4ba677d0..bafa78a6085f 100644 --- a/include/linux/mapdirect.h +++ b/include/linux/mapdirect.h @@ -26,7 +26,7 @@ struct lease_direct { struct lease_direct_state *lds; }; -#if IS_ENABLED(CONFIG_FS_DAX) +#if IS_ENABLED(CONFIG_DAX_MAP_DIRECT) struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma); int put_map_direct_vma(struct map_direct_state *mds); void get_map_direct_vma(struct map_direct_state *mds); ^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v7 09/12] xfs: wire up ->lease_direct() 2017-10-06 22:35 [PATCH v7 00/12] MAP_DIRECT for DAX RDMA and userspace flush Dan Williams ` (4 preceding siblings ...) [not found] ` <150732931273.22363.8436792888326501071.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> @ 2017-10-06 22:36 ` Dan Williams [not found] ` <150732936625.22363.7638037715540836828.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> 2017-10-06 22:36 ` [PATCH v7 11/12] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings Dan Williams 2017-10-06 22:36 ` [PATCH v7 12/12] tools/testing/nvdimm: enable rdma unit tests Dan Williams 7 siblings, 1 reply; 49+ messages in thread From: Dan Williams @ 2017-10-06 22:36 UTC (permalink / raw) To: linux-nvdimm Cc: linux-xfs, Jan Kara, Darrick J. Wong, linux-rdma, linux-api, Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm, Jeff Moyer, linux-fsdevel, Jeff Layton, Ross Zwisler A 'lease_direct' lease requires that the vma have a valid MAP_DIRECT mapping established. For xfs we establish a new lease and then check if the MAP_DIRECT mapping has been broken. We want to be sure that the process will receive notification that the MAP_DIRECT mapping is being torn down so it knows why other code paths are throwing failures. For example in the RDMA/ibverbs case we want ibv_reg_mr() to fail if the MAP_DIRECT mapping is invalid or in the process of being invalidated. Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jeff Layton <jlayton@poochiereds.net> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- fs/xfs/xfs_file.c | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index e35518600e28..823b65f17429 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1166,6 +1166,33 @@ xfs_filemap_direct_close( put_map_direct_vma(vma->vm_private_data); } +static struct lease_direct * +xfs_filemap_direct_lease( + struct vm_area_struct *vma, + void (*break_fn)(void *), + void *owner) +{ + struct lease_direct *ld; + + ld = map_direct_lease(vma, break_fn, owner); + + if (IS_ERR(ld)) + return ld; + + /* + * We now have an established lease while the base MAP_DIRECT + * lease was not broken. So, we know that the "lease holder" will + * receive a SIGIO notification when the lease is broken and + * take any necessary cleanup actions. + */ + if (!is_map_direct_broken(vma->vm_private_data)) + return ld; + + map_direct_lease_destroy(ld); + + return ERR_PTR(-ENXIO); +} + static const struct vm_operations_struct xfs_file_vm_direct_ops = { .fault = xfs_filemap_fault, .huge_fault = xfs_filemap_huge_fault, @@ -1175,6 +1202,7 @@ static const struct vm_operations_struct xfs_file_vm_direct_ops = { .open = xfs_filemap_direct_open, .close = xfs_filemap_direct_close, + .lease_direct = xfs_filemap_direct_lease, }; static const struct vm_operations_struct xfs_file_vm_ops = { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 49+ messages in thread
[parent not found: <150732936625.22363.7638037715540836828.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH v7 09/12] xfs: wire up ->lease_direct() [not found] ` <150732936625.22363.7638037715540836828.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> @ 2017-10-09 3:45 ` Dave Chinner 2017-10-09 17:10 ` Dan Williams 0 siblings, 1 reply; 49+ messages in thread From: Dave Chinner @ 2017-10-09 3:45 UTC (permalink / raw) To: Dan Williams Cc: J. Bruce Fields, Jan Kara, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jeff Layton, Christoph Hellwig On Fri, Oct 06, 2017 at 03:36:06PM -0700, Dan Williams wrote: > A 'lease_direct' lease requires that the vma have a valid MAP_DIRECT > mapping established. For xfs we establish a new lease and then check if > the MAP_DIRECT mapping has been broken. We want to be sure that the > process will receive notification that the MAP_DIRECT mapping is being > torn down so it knows why other code paths are throwing failures. > > For example in the RDMA/ibverbs case we want ibv_reg_mr() to fail if the > MAP_DIRECT mapping is invalid or in the process of being invalidated. > > Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > Cc: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> > Cc: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> > Cc: "Darrick J. Wong" <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> > Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > Cc: Jeff Layton <jlayton-vpEMnDpepFuMZCB2o+C8xQ@public.gmane.org> > Cc: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> > Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> > --- > fs/xfs/xfs_file.c | 28 ++++++++++++++++++++++++++++ > 1 file changed, 28 insertions(+) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index e35518600e28..823b65f17429 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -1166,6 +1166,33 @@ xfs_filemap_direct_close( > put_map_direct_vma(vma->vm_private_data); > } > > +static struct lease_direct * > +xfs_filemap_direct_lease( > + struct vm_area_struct *vma, > + void (*break_fn)(void *), > + void *owner) > +{ > + struct lease_direct *ld; > + > + ld = map_direct_lease(vma, break_fn, owner); > + > + if (IS_ERR(ld)) > + return ld; > + > + /* > + * We now have an established lease while the base MAP_DIRECT > + * lease was not broken. So, we know that the "lease holder" will > + * receive a SIGIO notification when the lease is broken and > + * take any necessary cleanup actions. > + */ > + if (!is_map_direct_broken(vma->vm_private_data)) > + return ld; > + > + map_direct_lease_destroy(ld); > + > + return ERR_PTR(-ENXIO); What's any of this got to do with XFS? Shouldn't it be in generic code, and called generic_filemap_direct_lease()? Cheers, Dave. -- Dave Chinner david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v7 09/12] xfs: wire up ->lease_direct() 2017-10-09 3:45 ` Dave Chinner @ 2017-10-09 17:10 ` Dan Williams 0 siblings, 0 replies; 49+ messages in thread From: Dan Williams @ 2017-10-09 17:10 UTC (permalink / raw) To: Dave Chinner Cc: linux-nvdimm@lists.01.org, linux-xfs, Jan Kara, Darrick J. Wong, linux-rdma, Linux API, Christoph Hellwig, J. Bruce Fields, Linux MM, Jeff Moyer, linux-fsdevel, Jeff Layton, Ross Zwisler On Sun, Oct 8, 2017 at 8:45 PM, Dave Chinner <david@fromorbit.com> wrote: > On Fri, Oct 06, 2017 at 03:36:06PM -0700, Dan Williams wrote: >> A 'lease_direct' lease requires that the vma have a valid MAP_DIRECT >> mapping established. For xfs we establish a new lease and then check if >> the MAP_DIRECT mapping has been broken. We want to be sure that the >> process will receive notification that the MAP_DIRECT mapping is being >> torn down so it knows why other code paths are throwing failures. >> >> For example in the RDMA/ibverbs case we want ibv_reg_mr() to fail if the >> MAP_DIRECT mapping is invalid or in the process of being invalidated. >> >> Cc: Jan Kara <jack@suse.cz> >> Cc: Jeff Moyer <jmoyer@redhat.com> >> Cc: Christoph Hellwig <hch@lst.de> >> Cc: Dave Chinner <david@fromorbit.com> >> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> >> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> >> Cc: Jeff Layton <jlayton@poochiereds.net> >> Cc: "J. Bruce Fields" <bfields@fieldses.org> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com> >> --- >> fs/xfs/xfs_file.c | 28 ++++++++++++++++++++++++++++ >> 1 file changed, 28 insertions(+) >> >> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c >> index e35518600e28..823b65f17429 100644 >> --- a/fs/xfs/xfs_file.c >> +++ b/fs/xfs/xfs_file.c >> @@ -1166,6 +1166,33 @@ xfs_filemap_direct_close( >> put_map_direct_vma(vma->vm_private_data); >> } >> >> +static struct lease_direct * >> +xfs_filemap_direct_lease( >> + struct vm_area_struct *vma, >> + void (*break_fn)(void *), >> + void *owner) >> +{ >> + struct lease_direct *ld; >> + >> + ld = map_direct_lease(vma, break_fn, owner); >> + >> + if (IS_ERR(ld)) >> + return ld; >> + >> + /* >> + * We now have an established lease while the base MAP_DIRECT >> + * lease was not broken. So, we know that the "lease holder" will >> + * receive a SIGIO notification when the lease is broken and >> + * take any necessary cleanup actions. >> + */ >> + if (!is_map_direct_broken(vma->vm_private_data)) >> + return ld; >> + >> + map_direct_lease_destroy(ld); >> + >> + return ERR_PTR(-ENXIO); > > What's any of this got to do with XFS? Shouldn't it be in generic > code, and called generic_filemap_direct_lease()? True, I can move this to generic code. The filesystem is in charge of where it wants to store the 'struct map_direct_state' context, but for generic_filemap_direct_lease() it can just assume that it is stored in ->vm_private_data. I'll add comments to this effect on the new routine. ^ permalink raw reply [flat|nested] 49+ messages in thread
* [PATCH v7 11/12] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings 2017-10-06 22:35 [PATCH v7 00/12] MAP_DIRECT for DAX RDMA and userspace flush Dan Williams ` (5 preceding siblings ...) 2017-10-06 22:36 ` [PATCH v7 09/12] xfs: " Dan Williams @ 2017-10-06 22:36 ` Dan Williams 2017-10-08 4:02 ` [PATCH v8 1/2] iommu: up-level sg_num_pages() from amd-iommu Dan Williams 2017-10-08 4:04 ` [PATCH v8 2/2] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings Dan Williams 2017-10-06 22:36 ` [PATCH v7 12/12] tools/testing/nvdimm: enable rdma unit tests Dan Williams 7 siblings, 2 replies; 49+ messages in thread From: Dan Williams @ 2017-10-06 22:36 UTC (permalink / raw) To: linux-nvdimm Cc: Sean Hefty, linux-xfs, Jan Kara, Darrick J. Wong, linux-rdma, linux-api, Dave Chinner, Jeff Moyer, Christoph Hellwig, J. Bruce Fields, linux-mm, Doug Ledford, Ross Zwisler, linux-fsdevel, Jeff Layton, Hal Rosenstock Currently the ibverbs core in the kernel is completely unaware of the dangers of filesystem-DAX mappings. Specifically, the filesystem is free to move file blocks at will. In the case of DAX, it means that RDMA to a given file offset can dynamically switch to another file offset, another file, or free space with no notification to RDMA device to cease operations. Historically, this lack of communication between the ibverbs core and filesystem was not a problem because RDMA always targeted dynamically allocated page cache, so at least the RDMA device would have valid memory to target even if the file was being modified. With DAX we need to add coordination since RDMA is bypassing page-cache and going direct to on-media pages of the file. RDMA to DAX can cause damage if filesystem blocks move / change state. Use the new ->lease_direct() operation to get a notification when the filesystem is invalidating the block map of the file and needs RDMA operations to stop. Given that the kernel can not be in a position where it needs to wait indefinitely for userspace to stop a device we need a mechanism where the kernel can force-revoke access. Towards that end, use the new dma_has_iommu() helper to determine if ib_dma_unmap_sg() is sufficient for revoking access. Once we have that assurance and a ->lease_direct() lease we can safely allow RDMA to DAX. Cc: Sean Hefty <sean.hefty@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Hal Rosenstock <hal.rosenstock@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jeff Layton <jlayton@poochiereds.net> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/infiniband/core/umem.c | 90 ++++++++++++++++++++++++++++++++++------ include/rdma/ib_umem.h | 8 ++++ 2 files changed, 85 insertions(+), 13 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 21e60b1e2ff4..dc3ae1bee669 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -36,6 +36,7 @@ #include <linux/dma-mapping.h> #include <linux/sched/signal.h> #include <linux/sched/mm.h> +#include <linux/mapdirect.h> #include <linux/export.h> #include <linux/hugetlb.h> #include <linux/slab.h> @@ -46,11 +47,12 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty) { + struct lease_direct *ld, *_ld; struct scatterlist *sg; struct page *page; int i; - if (umem->nmap > 0) + if (umem->nmap > 0 && test_and_clear_bit(IB_UMEM_MAPPED, &umem->state)) ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->npages, DMA_BIDIRECTIONAL); @@ -64,8 +66,22 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d } sg_free_table(&umem->sg_head); - return; + list_for_each_entry_safe(ld, _ld, &umem->leases, list) { + list_del_init(&ld->list); + map_direct_lease_destroy(ld); + } +} + +static void ib_umem_lease_break(void *__umem) +{ + struct ib_umem *umem = umem; + struct ib_device *dev = umem->context->device; + + if (umem->nmap > 0 && test_and_clear_bit(IB_UMEM_MAPPED, &umem->state)) + ib_dma_unmap_sg(dev, umem->sg_head.sgl, + umem->npages, + DMA_BIDIRECTIONAL); } /** @@ -96,7 +112,10 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, struct scatterlist *sg, *sg_list_start; int need_release = 0; unsigned int gup_flags = FOLL_WRITE; + struct vm_area_struct *vma_prev = NULL; + struct device *dma_dev; + dma_dev = context->device->dma_device; if (dmasync) dma_attrs |= DMA_ATTR_WRITE_BARRIER; @@ -120,6 +139,8 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, umem->address = addr; umem->page_shift = PAGE_SHIFT; umem->pid = get_task_pid(current, PIDTYPE_PID); + INIT_LIST_HEAD(&umem->leases); + set_bit(IB_UMEM_MAPPED, &umem->state); /* * We ask for writable memory if any of the following * access flags are set. "Local write" and "remote write" @@ -147,19 +168,21 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, umem->hugetlb = 1; page_list = (struct page **) __get_free_page(GFP_KERNEL); - if (!page_list) { - put_pid(umem->pid); - kfree(umem); - return ERR_PTR(-ENOMEM); - } + if (!page_list) + goto err_pagelist; /* - * if we can't alloc the vma_list, it's not so bad; - * just assume the memory is not hugetlb memory + * If DAX is enabled we need the vma to setup a ->lease_direct() + * lease to protect against file modifications, otherwise we can + * tolerate a failure to allocate the vma_list and just assume + * that all vmas are not hugetlb-vmas. */ vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL); - if (!vma_list) + if (!vma_list) { + if (IS_ENABLED(CONFIG_DAX_MAP_DIRECT)) + goto err_vmalist; umem->hugetlb = 0; + } npages = ib_umem_num_pages(umem); @@ -199,15 +222,50 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, if (ret < 0) goto out; - umem->npages += ret; cur_base += ret * PAGE_SIZE; npages -= ret; for_each_sg(sg_list_start, sg, ret, i) { - if (vma_list && !is_vm_hugetlb_page(vma_list[i])) - umem->hugetlb = 0; + const struct vm_operations_struct *vm_ops; + struct vm_area_struct *vma; + struct lease_direct *ld; sg_set_page(sg, page_list[i], PAGE_SIZE, 0); + umem->npages++; + + if (!vma_list) + continue; + vma = vma_list[i]; + + if (vma == vma_prev) + continue; + vma_prev = vma; + + if (!is_vm_hugetlb_page(vma)) + umem->hugetlb = 0; + + if (!vma_is_dax(vma)) + continue; + + vm_ops = vma->vm_ops; + if (!vm_ops->lease_direct) { + dev_info(dma_dev, "DAX-RDMA needs lease_direct\n"); + ret = -EOPNOTSUPP; + goto out; + } + + if (!dma_has_iommu(dma_dev)) { + dev_info(dma_dev, "DAX-RDMA needs iommu\n"); + ret = -EOPNOTSUPP; + goto out; + } + ld = vm_ops->lease_direct(vma, ib_umem_lease_break, + umem); + if (IS_ERR(ld)) { + ret = PTR_ERR(ld); + goto out; + } + list_add(&ld->list, &umem->leases); } /* preparing for next loop */ @@ -242,6 +300,12 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, free_page((unsigned long) page_list); return ret < 0 ? ERR_PTR(ret) : umem; +err_vmalist: + free_page((unsigned long) page_list); +err_pagelist: + put_pid(umem->pid); + kfree(umem); + return ERR_PTR(-ENOMEM); } EXPORT_SYMBOL(ib_umem_get); diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index 23159dd5be18..796ffe5b8dc3 100644 --- a/include/rdma/ib_umem.h +++ b/include/rdma/ib_umem.h @@ -40,6 +40,7 @@ struct ib_ucontext; struct ib_umem_odp; +#define IB_UMEM_MAPPED 0 struct ib_umem { struct ib_ucontext *context; size_t length; @@ -55,6 +56,13 @@ struct ib_umem { struct sg_table sg_head; int nmap; int npages; + /* + * Note: no lock protects this list since we assume memory + * registration never races unregistration for a given ib_umem + * instance. + */ + struct list_head leases; + unsigned long state; }; /* Returns the offset of the umem start relative to the first page. */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v8 1/2] iommu: up-level sg_num_pages() from amd-iommu 2017-10-06 22:36 ` [PATCH v7 11/12] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings Dan Williams @ 2017-10-08 4:02 ` Dan Williams 2017-10-08 4:04 ` [PATCH v8 2/2] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings Dan Williams 1 sibling, 0 replies; 49+ messages in thread From: Dan Williams @ 2017-10-08 4:02 UTC (permalink / raw) To: linux-nvdimm Cc: linux-rdma, linux-api, Joerg Roedel, linux-xfs, linux-mm, iommu, linux-fsdevel iommu_sg_num_pages() is a helper that walks a scattlerlist and counts pages taking segment boundaries and iommu_num_pages() into account. Up-level it for determining the IOVA range that dma_map_ops established at dma_map_sg() time. The intent is to iommu_unmap() the IOVA range in advance of freeing IOVA range. Cc: Joerg Roedel <joro@8bytes.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- New patch in v8. drivers/iommu/amd_iommu.c | 30 ++---------------------------- drivers/iommu/iommu.c | 27 +++++++++++++++++++++++++++ include/linux/iommu.h | 2 ++ 3 files changed, 31 insertions(+), 28 deletions(-) diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c index c8e1a45af182..4795b0823469 100644 --- a/drivers/iommu/amd_iommu.c +++ b/drivers/iommu/amd_iommu.c @@ -2459,32 +2459,6 @@ static void unmap_page(struct device *dev, dma_addr_t dma_addr, size_t size, __unmap_single(dma_dom, dma_addr, size, dir); } -static int sg_num_pages(struct device *dev, - struct scatterlist *sglist, - int nelems) -{ - unsigned long mask, boundary_size; - struct scatterlist *s; - int i, npages = 0; - - mask = dma_get_seg_boundary(dev); - boundary_size = mask + 1 ? ALIGN(mask + 1, PAGE_SIZE) >> PAGE_SHIFT : - 1UL << (BITS_PER_LONG - PAGE_SHIFT); - - for_each_sg(sglist, s, nelems, i) { - int p, n; - - s->dma_address = npages << PAGE_SHIFT; - p = npages % boundary_size; - n = iommu_num_pages(sg_phys(s), s->length, PAGE_SIZE); - if (p + n > boundary_size) - npages += boundary_size - p; - npages += n; - } - - return npages; -} - /* * The exported map_sg function for dma_ops (handles scatter-gather * lists). @@ -2507,7 +2481,7 @@ static int map_sg(struct device *dev, struct scatterlist *sglist, dma_dom = to_dma_ops_domain(domain); dma_mask = *dev->dma_mask; - npages = sg_num_pages(dev, sglist, nelems); + npages = iommu_sg_num_pages(dev, sglist, nelems); address = dma_ops_alloc_iova(dev, dma_dom, npages, dma_mask); if (address == AMD_IOMMU_MAPPING_ERROR) @@ -2585,7 +2559,7 @@ static void unmap_sg(struct device *dev, struct scatterlist *sglist, startaddr = sg_dma_address(sglist) & PAGE_MASK; dma_dom = to_dma_ops_domain(domain); - npages = sg_num_pages(dev, sglist, nelems); + npages = iommu_sg_num_pages(dev, sglist, nelems); __unmap_single(dma_dom, startaddr, npages << PAGE_SHIFT, dir); } diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 3de5c0bcb5cc..cfe6eeea3578 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -33,6 +33,7 @@ #include <linux/bitops.h> #include <linux/property.h> #include <trace/events/iommu.h> +#include <linux/iommu-helper.h> static struct kset *iommu_group_kset; static DEFINE_IDA(iommu_group_ida); @@ -1631,6 +1632,32 @@ size_t iommu_unmap_fast(struct iommu_domain *domain, } EXPORT_SYMBOL_GPL(iommu_unmap_fast); +int iommu_sg_num_pages(struct device *dev, struct scatterlist *sglist, + int nelems) +{ + unsigned long mask, boundary_size; + struct scatterlist *s; + int i, npages = 0; + + mask = dma_get_seg_boundary(dev); + boundary_size = mask + 1 ? ALIGN(mask + 1, PAGE_SIZE) >> PAGE_SHIFT + : 1UL << (BITS_PER_LONG - PAGE_SHIFT); + + for_each_sg(sglist, s, nelems, i) { + int p, n; + + s->dma_address = npages << PAGE_SHIFT; + p = npages % boundary_size; + n = iommu_num_pages(sg_phys(s), s->length, PAGE_SIZE); + if (p + n > boundary_size) + npages += boundary_size - p; + npages += n; + } + + return npages; +} +EXPORT_SYMBOL_GPL(iommu_sg_num_pages); + size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova, struct scatterlist *sg, unsigned int nents, int prot) { diff --git a/include/linux/iommu.h b/include/linux/iommu.h index a7f2ac689d29..5b2d20e1475a 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -303,6 +303,8 @@ extern size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size); extern size_t iommu_unmap_fast(struct iommu_domain *domain, unsigned long iova, size_t size); +extern int iommu_sg_num_pages(struct device *dev, struct scatterlist *sglist, + int nelems); extern size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova, struct scatterlist *sg,unsigned int nents, int prot); ^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v8 2/2] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings 2017-10-06 22:36 ` [PATCH v7 11/12] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings Dan Williams 2017-10-08 4:02 ` [PATCH v8 1/2] iommu: up-level sg_num_pages() from amd-iommu Dan Williams @ 2017-10-08 4:04 ` Dan Williams [not found] ` <150743537023.13602.3520782942682280917.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> 1 sibling, 1 reply; 49+ messages in thread From: Dan Williams @ 2017-10-08 4:04 UTC (permalink / raw) To: linux-nvdimm Cc: Sean Hefty, linux-xfs, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma, linux-api, Joerg Roedel, Dave Chinner, Jeff Moyer, iommu, Christoph Hellwig, J. Bruce Fields, linux-mm, Doug Ledford, Ross Zwisler, linux-fsdevel, Jeff Layton, David Woodhouse, Hal Rosenstock Currently the ibverbs core in the kernel is completely unaware of the dangers of filesystem-DAX mappings. Specifically, the filesystem is free to move file blocks at will. In the case of DAX, it means that RDMA to a given file offset can dynamically switch to another file offset, another file, or free space with no notification to RDMA device to cease operations. Historically, this lack of communication between the ibverbs core and filesystem was not a problem because RDMA always targeted dynamically allocated page cache, so at least the RDMA device would have valid memory to target even if the file was being modified. With DAX we need to add coordination since RDMA is bypassing page-cache and going direct to on-media pages of the file. RDMA to DAX can cause damage if filesystem blocks move / change state. Use the new ->lease_direct() operation to get a notification when the filesystem is invalidating the block map of the file and needs RDMA operations to stop. Given that the kernel can not be in a position where it needs to wait indefinitely for userspace to stop a device we need a mechanism where the kernel can force-revoke access. Towards that end, use the dma_get_iommu_domain() to both check if the device has domain mappings that can be invalidated and retrieve the iommu_domain for use with iommu_unmap. Once we have that assurance that we can block in-flight I/O when the file's block map changes then we can safely allow RDMA to DAX. Cc: Sean Hefty <sean.hefty@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Hal Rosenstock <hal.rosenstock@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Ashok Raj <ashok.raj@intel.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jeff Layton <jlayton@poochiereds.net> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- Changes since v7: * Switch from dma_has_iommu() to dma_get_iommu_domain(). * Switch from dma_unmap_sg() at lease break time, to iommu_unmap() so that the IOVA remains allocated while the device might still be sending transactions. drivers/infiniband/core/umem.c | 90 +++++++++++++++++++++++++++++++++++----- include/rdma/ib_umem.h | 8 ++++ 2 files changed, 86 insertions(+), 12 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 21e60b1e2ff4..5e4598982359 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -36,6 +36,7 @@ #include <linux/dma-mapping.h> #include <linux/sched/signal.h> #include <linux/sched/mm.h> +#include <linux/mapdirect.h> #include <linux/export.h> #include <linux/hugetlb.h> #include <linux/slab.h> @@ -46,10 +47,16 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty) { + struct lease_direct *ld, *_ld; struct scatterlist *sg; struct page *page; int i; + list_for_each_entry_safe(ld, _ld, &umem->leases, list) { + list_del_init(&ld->list); + map_direct_lease_destroy(ld); + } + if (umem->nmap > 0) ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->npages, @@ -64,10 +71,20 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d } sg_free_table(&umem->sg_head); - return; } +static void ib_umem_lease_break(void *__umem) +{ + struct ib_umem *umem = umem; + struct ib_device *idev = umem->context->device; + struct device *dev = idev->dma_device; + struct scatterlist *sgl = umem->sg_head.sgl; + + iommu_unmap(umem->iommu, sg_dma_address(sgl) & PAGE_MASK, + iommu_sg_num_pages(dev, sgl, umem->npages)); +} + /** * ib_umem_get - Pin and DMA map userspace memory. * @@ -96,7 +113,10 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, struct scatterlist *sg, *sg_list_start; int need_release = 0; unsigned int gup_flags = FOLL_WRITE; + struct vm_area_struct *vma_prev = NULL; + struct device *dma_dev; + dma_dev = context->device->dma_device; if (dmasync) dma_attrs |= DMA_ATTR_WRITE_BARRIER; @@ -120,6 +140,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, umem->address = addr; umem->page_shift = PAGE_SHIFT; umem->pid = get_task_pid(current, PIDTYPE_PID); + INIT_LIST_HEAD(&umem->leases); /* * We ask for writable memory if any of the following * access flags are set. "Local write" and "remote write" @@ -147,19 +168,21 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, umem->hugetlb = 1; page_list = (struct page **) __get_free_page(GFP_KERNEL); - if (!page_list) { - put_pid(umem->pid); - kfree(umem); - return ERR_PTR(-ENOMEM); - } + if (!page_list) + goto err_pagelist; /* - * if we can't alloc the vma_list, it's not so bad; - * just assume the memory is not hugetlb memory + * If DAX is enabled we need the vma to setup a ->lease_direct() + * lease to protect against file modifications, otherwise we can + * tolerate a failure to allocate the vma_list and just assume + * that all vmas are not hugetlb-vmas. */ vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL); - if (!vma_list) + if (!vma_list) { + if (IS_ENABLED(CONFIG_DAX_MAP_DIRECT)) + goto err_vmalist; umem->hugetlb = 0; + } npages = ib_umem_num_pages(umem); @@ -199,15 +222,52 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, if (ret < 0) goto out; - umem->npages += ret; cur_base += ret * PAGE_SIZE; npages -= ret; for_each_sg(sg_list_start, sg, ret, i) { - if (vma_list && !is_vm_hugetlb_page(vma_list[i])) - umem->hugetlb = 0; + const struct vm_operations_struct *vm_ops; + struct vm_area_struct *vma; + struct lease_direct *ld; sg_set_page(sg, page_list[i], PAGE_SIZE, 0); + umem->npages++; + + if (!vma_list) + continue; + vma = vma_list[i]; + + if (vma == vma_prev) + continue; + vma_prev = vma; + + if (!is_vm_hugetlb_page(vma)) + umem->hugetlb = 0; + + if (!vma_is_dax(vma)) + continue; + + vm_ops = vma->vm_ops; + if (!vm_ops->lease_direct) { + dev_info(dma_dev, "DAX-RDMA requires a MAP_DIRECT mapping\n"); + ret = -EOPNOTSUPP; + goto out; + } + + if (!umem->iommu) + umem->iommu = dma_get_iommu_domain(dma_dev); + if (!umem->iommu) { + dev_info(dma_dev, "DAX-RDMA requires an iommu protected device\n"); + ret = -EOPNOTSUPP; + goto out; + } + ld = vm_ops->lease_direct(vma, ib_umem_lease_break, + umem); + if (IS_ERR(ld)) { + ret = PTR_ERR(ld); + goto out; + } + list_add(&ld->list, &umem->leases); } /* preparing for next loop */ @@ -242,6 +302,12 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, free_page((unsigned long) page_list); return ret < 0 ? ERR_PTR(ret) : umem; +err_vmalist: + free_page((unsigned long) page_list); +err_pagelist: + put_pid(umem->pid); + kfree(umem); + return ERR_PTR(-ENOMEM); } EXPORT_SYMBOL(ib_umem_get); diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index 23159dd5be18..5048be012f96 100644 --- a/include/rdma/ib_umem.h +++ b/include/rdma/ib_umem.h @@ -34,6 +34,7 @@ #define IB_UMEM_H #include <linux/list.h> +#include <linux/iommu.h> #include <linux/scatterlist.h> #include <linux/workqueue.h> @@ -55,6 +56,13 @@ struct ib_umem { struct sg_table sg_head; int nmap; int npages; + /* + * Note: no lock protects this list since we assume memory + * registration never races unregistration for a given ib_umem + * instance. + */ + struct list_head leases; + struct iommu_domain *iommu; }; /* Returns the offset of the umem start relative to the first page. */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 49+ messages in thread
[parent not found: <150743537023.13602.3520782942682280917.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH v8 2/2] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings [not found] ` <150743537023.13602.3520782942682280917.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> @ 2017-10-08 6:45 ` kbuild test robot [not found] ` <201710081447.sQSonloO%fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: kbuild test robot @ 2017-10-08 6:45 UTC (permalink / raw) To: Dan Williams Cc: Jan Kara, Dave Chinner, J. Bruce Fields, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Jeff Layton, Sean Hefty, Ashok Raj, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Joerg Roedel, Christoph Hellwig, Doug Ledford, Hal Rosenstock, linux-api-u79uwXL29TY76Z2rM5mHXA, Darrick J. Wong, linux-xfs-u79uwXL29TY76Z2rM5mHXA, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, kbuild-all-JC7UmRfGjtg, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, David Woodhouse Hi Dan, [auto build test ERROR on rdma/master] [also build test ERROR on v4.14-rc3 next-20170929] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Dan-Williams/iommu-up-level-sg_num_pages-from-amd-iommu/20171008-133505 base: https://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git master config: i386-randconfig-n0-201741 (attached as .config) compiler: gcc-4.8 (Debian 4.8.4-1) 4.8.4 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): >> drivers/infiniband/core/umem.c:39:29: fatal error: linux/mapdirect.h: No such file or directory #include <linux/mapdirect.h> ^ compilation terminated. vim +39 drivers/infiniband/core/umem.c > 39 #include <linux/mapdirect.h> 40 #include <linux/export.h> 41 #include <linux/hugetlb.h> 42 #include <linux/slab.h> 43 #include <rdma/ib_umem_odp.h> 44 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <201710081447.sQSonloO%fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH v8 2/2] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings [not found] ` <201710081447.sQSonloO%fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> @ 2017-10-08 15:49 ` Dan Williams 0 siblings, 0 replies; 49+ messages in thread From: Dan Williams @ 2017-10-08 15:49 UTC (permalink / raw) To: kbuild test robot Cc: kbuild-all-JC7UmRfGjtg, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org, Sean Hefty, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Linux API, Joerg Roedel, Dave Chinner, Jeff Moyer, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Christoph Hellwig, J. Bruce Fields, Linux MM, Doug Ledford, Ross Zwisler, linux-fsdevel, Jeff Layton, David Woodhouse, Hal On Sat, Oct 7, 2017 at 11:45 PM, kbuild test robot <lkp-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote: > Hi Dan, > > [auto build test ERROR on rdma/master] > [also build test ERROR on v4.14-rc3 next-20170929] > [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] This was a fixed up resend of patch [v7 11/12]. It's not clear how to teach the kbuild robot to be aware of patch-replies to individual patches in the series. I.e. reworked patches without resending the complete series? > url: https://github.com/0day-ci/linux/commits/Dan-Williams/iommu-up-level-sg_num_pages-from-amd-iommu/20171008-133505 > base: https://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git master > config: i386-randconfig-n0-201741 (attached as .config) > compiler: gcc-4.8 (Debian 4.8.4-1) 4.8.4 > reproduce: > # save the attached .config to linux build tree > make ARCH=i386 > > All errors (new ones prefixed by >>): > >>> drivers/infiniband/core/umem.c:39:29: fatal error: linux/mapdirect.h: No such file or directory > #include <linux/mapdirect.h> mapdirect.h indeed does not exist when missing the earlier patches in the series. It would be slick if the 0day-robot read the the "in-reply-to" header and auto replaced a patch in a series, but that would be a feature approaching magic. > compilation terminated. > > vim +39 drivers/infiniband/core/umem.c > > > 39 #include <linux/mapdirect.h> > 40 #include <linux/export.h> > 41 #include <linux/hugetlb.h> > 42 #include <linux/slab.h> > 43 #include <rdma/ib_umem_odp.h> > 44 > > --- > 0-DAY kernel test infrastructure Open Source Technology Center > https://lists.01.org/pipermail/kbuild-all Intel Corporation ^ permalink raw reply [flat|nested] 49+ messages in thread
* [PATCH v7 12/12] tools/testing/nvdimm: enable rdma unit tests 2017-10-06 22:35 [PATCH v7 00/12] MAP_DIRECT for DAX RDMA and userspace flush Dan Williams ` (6 preceding siblings ...) 2017-10-06 22:36 ` [PATCH v7 11/12] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings Dan Williams @ 2017-10-06 22:36 ` Dan Williams 7 siblings, 0 replies; 49+ messages in thread From: Dan Williams @ 2017-10-06 22:36 UTC (permalink / raw) To: linux-nvdimm; +Cc: linux-xfs, linux-mm, linux-fsdevel, linux-api, linux-rdma Provide a mock dma_has_iommu() for the ibverbs core. Enable ib_umem_get() to satisfy its DAX safety checks for a controlled test. Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- tools/testing/nvdimm/Kbuild | 31 +++++++++++++++++++++++++++++++ tools/testing/nvdimm/config_check.c | 2 ++ tools/testing/nvdimm/test/iomap.c | 6 ++++++ 3 files changed, 39 insertions(+) diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild index d870520da68b..e4ee7f482ac0 100644 --- a/tools/testing/nvdimm/Kbuild +++ b/tools/testing/nvdimm/Kbuild @@ -15,11 +15,13 @@ ldflags-y += --wrap=insert_resource ldflags-y += --wrap=remove_resource ldflags-y += --wrap=acpi_evaluate_object ldflags-y += --wrap=acpi_evaluate_dsm +ldflags-y += --wrap=dma_has_iommu DRIVERS := ../../../drivers NVDIMM_SRC := $(DRIVERS)/nvdimm ACPI_SRC := $(DRIVERS)/acpi/nfit DAX_SRC := $(DRIVERS)/dax +IBCORE := $(DRIVERS)/infiniband/core ccflags-y := -I$(src)/$(NVDIMM_SRC)/ obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o @@ -33,6 +35,7 @@ obj-$(CONFIG_DAX) += dax.o endif obj-$(CONFIG_DEV_DAX) += device_dax.o obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o +obj-$(CONFIG_INFINIBAND) += ib_core.o nfit-y := $(ACPI_SRC)/core.o nfit-$(CONFIG_X86_MCE) += $(ACPI_SRC)/mce.o @@ -75,4 +78,32 @@ libnvdimm-$(CONFIG_NVDIMM_PFN) += $(NVDIMM_SRC)/pfn_devs.o libnvdimm-$(CONFIG_NVDIMM_DAX) += $(NVDIMM_SRC)/dax_devs.o libnvdimm-y += config_check.o +ib_core-y := $(IBCORE)/packer.o +ib_core-y += $(IBCORE)/ud_header.o +ib_core-y += $(IBCORE)/verbs.o +ib_core-y += $(IBCORE)/cq.o +ib_core-y += $(IBCORE)/rw.o +ib_core-y += $(IBCORE)/sysfs.o +ib_core-y += $(IBCORE)/device.o +ib_core-y += $(IBCORE)/fmr_pool.o +ib_core-y += $(IBCORE)/cache.o +ib_core-y += $(IBCORE)/netlink.o +ib_core-y += $(IBCORE)/roce_gid_mgmt.o +ib_core-y += $(IBCORE)/mr_pool.o +ib_core-y += $(IBCORE)/addr.o +ib_core-y += $(IBCORE)/sa_query.o +ib_core-y += $(IBCORE)/multicast.o +ib_core-y += $(IBCORE)/mad.o +ib_core-y += $(IBCORE)/smi.o +ib_core-y += $(IBCORE)/agent.o +ib_core-y += $(IBCORE)/mad_rmpp.o +ib_core-y += $(IBCORE)/security.o +ib_core-y += $(IBCORE)/nldev.o + +ib_core-$(CONFIG_INFINIBAND_USER_MEM) += $(IBCORE)/umem.o +ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += $(IBCORE)/umem_odp.o +ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += $(IBCORE)/umem_rbtree.o +ib_core-$(CONFIG_CGROUP_RDMA) += $(IBCORE)/cgroup.o +ib_core-y += config_check.o + obj-m += test/ diff --git a/tools/testing/nvdimm/config_check.c b/tools/testing/nvdimm/config_check.c index 7dc5a0af9b54..33e7c805bfd6 100644 --- a/tools/testing/nvdimm/config_check.c +++ b/tools/testing/nvdimm/config_check.c @@ -14,4 +14,6 @@ void check(void) BUILD_BUG_ON(!IS_MODULE(CONFIG_ACPI_NFIT)); BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX)); BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX_PMEM)); + BUILD_BUG_ON(!IS_ENABLED(CONFIG_INFINIBAND_USER_MEM)); + BUILD_BUG_ON(!IS_MODULE(CONFIG_INFINIBAND)); } diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c index e1f75a1914a1..1c240328ee5b 100644 --- a/tools/testing/nvdimm/test/iomap.c +++ b/tools/testing/nvdimm/test/iomap.c @@ -388,4 +388,10 @@ union acpi_object * __wrap_acpi_evaluate_dsm(acpi_handle handle, const guid_t *g } EXPORT_SYMBOL(__wrap_acpi_evaluate_dsm); +bool __wrap_dma_has_iommu(struct device *dev) +{ + return true; +} +EXPORT_SYMBOL(__wrap_dma_has_iommu); + MODULE_LICENSE("GPL v2"); ^ permalink raw reply related [flat|nested] 49+ messages in thread
end of thread, other threads:[~2017-10-15 15:21 UTC | newest] Thread overview: 49+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-10-06 22:35 [PATCH v7 00/12] MAP_DIRECT for DAX RDMA and userspace flush Dan Williams 2017-10-06 22:35 ` [PATCH v7 03/12] fs: introduce i_mapdcount Dan Williams [not found] ` <150732933283.22363.570426117546397495.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> 2017-10-09 3:08 ` Dave Chinner 2017-10-06 22:35 ` [PATCH v7 05/12] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT Dan Williams 2017-10-06 22:35 ` [PATCH v7 06/12] xfs: wire up MAP_DIRECT Dan Williams 2017-10-09 3:40 ` Dave Chinner 2017-10-09 17:08 ` Dan Williams 2017-10-09 22:50 ` Dave Chinner 2017-10-06 22:35 ` [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() Dan Williams 2017-10-06 22:45 ` David Woodhouse [not found] ` <1507329939.29211.434.camel-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> 2017-10-06 22:52 ` Dan Williams [not found] ` <CAPcyv4jLj2WOgPA+B5eYME0uZyuoy3gp35w+4rCa_EZxm-QSKA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-10-06 23:10 ` David Woodhouse [not found] ` <1507331434.29211.439.camel-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> 2017-10-06 23:15 ` Dan Williams 2017-10-07 11:08 ` David Woodhouse [not found] ` <1507374524.25529.13.camel-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> 2017-10-07 23:33 ` Dan Williams 2017-10-06 23:12 ` Dan Williams 2017-10-08 3:45 ` [PATCH v8] dma-mapping: introduce dma_get_iommu_domain() Dan Williams [not found] ` <150743420333.12880.6968831423519457797.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> 2017-10-09 10:37 ` Robin Murphy 2017-10-09 17:32 ` Dan Williams 2017-10-10 14:40 ` Raj, Ashok [not found] ` <150732935473.22363.1853399637339625023.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> 2017-10-09 18:58 ` [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() Jason Gunthorpe 2017-10-09 19:05 ` Dan Williams [not found] ` <CAPcyv4gXzC8OUgO_PciQ2phyq0YtmXjMGWvoPSVVuuZR7ohVCg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-10-09 19:18 ` Jason Gunthorpe [not found] ` <20171009191820.GD15336-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 2017-10-09 19:28 ` Dan Williams [not found] ` <CAPcyv4h_uQGBAX6-bMkkZLO_YyQ6t4n_b8tH8wU_P0Jh23N5MQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-10-10 17:25 ` Jason Gunthorpe [not found] ` <20171010172516.GA29915-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 2017-10-10 17:39 ` Dan Williams [not found] ` <CAPcyv4jL5fN7jjXkQum8ERQ45eW63dCYp5Pm6aHY4OPudz4Wsw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-10-10 18:05 ` Jason Gunthorpe 2017-10-10 20:17 ` Dan Williams 2017-10-12 18:27 ` Jason Gunthorpe 2017-10-12 20:10 ` Dan Williams 2017-10-13 6:50 ` Christoph Hellwig [not found] ` <20171013065047.GA26461-jcswGhMUV9g@public.gmane.org> 2017-10-13 15:03 ` Jason Gunthorpe [not found] ` <20171013150348.GA11257-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 2017-10-15 15:14 ` Matan Barak [not found] ` <CAAKD3BBR2CmQvg-3bqPog0VFrEm=QU-b-xBDH-_Q+sXV9NkFUA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-10-15 15:21 ` Dan Williams 2017-10-13 7:09 ` Christoph Hellwig [not found] ` <150732931273.22363.8436792888326501071.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> 2017-10-06 22:35 ` [PATCH v7 01/12] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags Dan Williams 2017-10-06 22:35 ` [PATCH v7 02/12] fs, mm: pass fd to ->mmap_validate() Dan Williams 2017-10-06 22:35 ` [PATCH v7 04/12] fs: MAP_DIRECT core Dan Williams 2017-10-06 22:36 ` [PATCH v7 08/12] fs, mapdirect: introduce ->lease_direct() Dan Williams 2017-10-06 22:36 ` [PATCH v7 10/12] device-dax: wire up ->lease_direct() Dan Williams 2017-10-06 22:36 ` [PATCH v7 09/12] xfs: " Dan Williams [not found] ` <150732936625.22363.7638037715540836828.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> 2017-10-09 3:45 ` Dave Chinner 2017-10-09 17:10 ` Dan Williams 2017-10-06 22:36 ` [PATCH v7 11/12] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings Dan Williams 2017-10-08 4:02 ` [PATCH v8 1/2] iommu: up-level sg_num_pages() from amd-iommu Dan Williams 2017-10-08 4:04 ` [PATCH v8 2/2] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings Dan Williams [not found] ` <150743537023.13602.3520782942682280917.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> 2017-10-08 6:45 ` kbuild test robot [not found] ` <201710081447.sQSonloO%fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> 2017-10-08 15:49 ` Dan Williams 2017-10-06 22:36 ` [PATCH v7 12/12] tools/testing/nvdimm: enable rdma unit tests Dan Williams
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).