* [PATCH 00/15] dax: prep work for fixing dax-dma vs truncate collisions
@ 2017-10-31 23:21 Dan Williams
2017-10-31 23:22 ` [PATCH 10/15] IB/core: disable memory registration of fileystem-dax vmas Dan Williams
2017-10-31 23:22 ` [PATCH 11/15] [media] v4l2: disable filesystem-dax mapping support Dan Williams
0 siblings, 2 replies; 5+ messages in thread
From: Dan Williams @ 2017-10-31 23:21 UTC (permalink / raw)
To: linux-nvdimm
Cc: Michal Hocko, Jan Kara, Peter Zijlstra, Benjamin Herrenschmidt,
Heiko Carstens, linux-mm, Paul Mackerras, Sean Hefty, hch,
Matthew Wilcox, linux-rdma, Michael Ellerman, Jeff Moyer,
Jason Gunthorpe, Doug Ledford, Ingo Molnar, Ross Zwisler,
Hal Rosenstock, linux-media, linux-fsdevel,
Jérôme Glisse, Mauro Carvalho Chehab, Gerald Schaefer,
Jens Axboe, linux-kernel, stable, linux-xfs, Martin Schwidefsky,
akpm, Kirill A. Shutemov
This is hopefully the uncontroversial lead-in set of changes that lay
the groundwork for solving the dax-dma vs truncate problem. The overview
of the changes is:
1/ Disable DAX when we do not have struct page entries backing dax
mappings, or otherwise allow limited DAX support for axonram and
dcssblk. Is anyone actually using the DAX capability of axonram
dcssblk?
2/ Disable code paths that establish potentially long lived DMA
access to a filesystem-dax memory mapping, i.e. RDMA and V4L2. In the
4.16 timeframe the plan is to introduce a "register memory for DMA
with a lease" mechanism for userspace to establish mappings but also
be responsible for tearing down the mapping when the kernel needs to
invalidate the mapping due to truncate or hole-punch.
3/ Add a wakeup mechanism for awaiting for DAX pages to be released
from DMA access.
This overall effort started when Christoph noted during the review of
the MAP_DIRECT proposal:
get_user_pages on DAX doesn't give the same guarantees as on
pagecache or anonymous memory, and that is the problem we need to
fix. In fact I'm pretty sure if we try hard enough (and we might
have to try very hard) we can see the same problem with plain direct
I/O and without any RDMA involved, e.g. do a larger direct I/O write
to memory that is mmap()ed from a DAX file, then truncate the DAX
file and reallocate the blocks, and we might corrupt that new file.
We'll probably need a special setup where there is little other
chance but to reallocate those used blocks.
So what we need to do first is to fix get_user_pages vs unmapping
DAX mmap()ed blocks, be that from a hole punch, truncate, COW
operation, etc.
Included in the changes is a nfit_test mechanism to trivially trigger
this collision by delaying the put_page() that the block layer performs
after performing direct-I/O to a filesystem-DAX page.
Given the ongoing coordination of this set across multiple sub-systems
and the dax core my proposal is to manage this as a branch in the nvdimm
tree with acks from mm, rdma, v4l2, ext4, and xfs.
---
Dan Williams (15):
dax: quiet bdev_dax_supported()
mm, dax: introduce pfn_t_special()
dax: require 'struct page' by default for filesystem dax
brd: remove dax support
dax: stop using VM_MIXEDMAP for dax
dax: stop using VM_HUGEPAGE for dax
dax: stop requiring a live device for dax_flush()
dax: store pfns in the radix
tools/testing/nvdimm: add 'bio_delay' mechanism
IB/core: disable memory registration of fileystem-dax vmas
[media] v4l2: disable filesystem-dax mapping support
mm, dax: enable filesystems to trigger page-idle callbacks
mm, devmap: introduce CONFIG_DEVMAP_MANAGED_PAGES
dax: associate mappings with inodes, and warn if dma collides with truncate
wait_bit: introduce {wait_on,wake_up}_devmap_idle
arch/powerpc/platforms/Kconfig | 1
arch/powerpc/sysdev/axonram.c | 3 -
drivers/block/Kconfig | 12 ---
drivers/block/brd.c | 65 --------------
drivers/dax/device.c | 1
drivers/dax/super.c | 113 +++++++++++++++++++++----
drivers/infiniband/core/umem.c | 49 ++++++++---
drivers/media/v4l2-core/videobuf-dma-sg.c | 39 ++++++++-
drivers/nvdimm/pmem.c | 13 +++
drivers/s390/block/Kconfig | 1
drivers/s390/block/dcssblk.c | 4 +
fs/Kconfig | 8 ++
fs/dax.c | 131 +++++++++++++++++++----------
fs/ext2/file.c | 1
fs/ext2/super.c | 6 +
fs/ext4/file.c | 1
fs/ext4/super.c | 6 +
fs/xfs/xfs_file.c | 2
fs/xfs/xfs_super.c | 20 ++--
include/linux/dax.h | 17 ++--
include/linux/memremap.h | 24 +++++
include/linux/mm.h | 47 ++++++----
include/linux/mm_types.h | 20 +++-
include/linux/pfn_t.h | 13 +++
include/linux/vma.h | 33 +++++++
include/linux/wait_bit.h | 10 ++
kernel/memremap.c | 36 ++++++--
kernel/sched/wait_bit.c | 64 ++++++++++++--
mm/Kconfig | 5 +
mm/hmm.c | 13 ---
mm/huge_memory.c | 8 +-
mm/ksm.c | 3 +
mm/madvise.c | 2
mm/memory.c | 22 ++++-
mm/migrate.c | 3 -
mm/mlock.c | 5 +
mm/mmap.c | 8 +-
mm/swap.c | 3 -
tools/testing/nvdimm/Kbuild | 1
tools/testing/nvdimm/test/iomap.c | 62 ++++++++++++++
tools/testing/nvdimm/test/nfit.c | 34 ++++++++
tools/testing/nvdimm/test/nfit_test.h | 1
42 files changed, 650 insertions(+), 260 deletions(-)
create mode 100644 include/linux/vma.h
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 10/15] IB/core: disable memory registration of fileystem-dax vmas
2017-10-31 23:21 [PATCH 00/15] dax: prep work for fixing dax-dma vs truncate collisions Dan Williams
@ 2017-10-31 23:22 ` Dan Williams
2017-11-02 20:13 ` Christoph Hellwig
2017-10-31 23:22 ` [PATCH 11/15] [media] v4l2: disable filesystem-dax mapping support Dan Williams
1 sibling, 1 reply; 5+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
To: linux-nvdimm
Cc: Sean Hefty, linux-xfs, akpm, linux-rdma, linux-kernel, Jeff Moyer,
stable, hch, Jason Gunthorpe, linux-mm, Doug Ledford,
linux-fsdevel, Ross Zwisler, Hal Rosenstock
Until there is a solution to the dma-to-dax vs truncate problem it is
not safe to allow RDMA to create long standing memory registrations
against filesytem-dax vmas. Device-dax vmas do not have this problem and
are explicitly allowed.
This is temporary until a "memory registration with layout-lease"
mechanism can be implemented, and is limited to non-ODP (On Demand
Paging) capable RDMA devices.
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: <linux-rdma@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
drivers/infiniband/core/umem.c | 49 +++++++++++++++++++++++++++++++---------
1 file changed, 38 insertions(+), 11 deletions(-)
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 21e60b1e2ff4..c30d286c1f24 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -147,19 +147,21 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
umem->hugetlb = 1;
page_list = (struct page **) __get_free_page(GFP_KERNEL);
- if (!page_list) {
- put_pid(umem->pid);
- kfree(umem);
- return ERR_PTR(-ENOMEM);
- }
+ if (!page_list)
+ goto err_pagelist;
/*
- * if we can't alloc the vma_list, it's not so bad;
- * just assume the memory is not hugetlb memory
+ * If DAX is enabled we need the vma to protect against
+ * registering filesystem-dax memory. Otherwise we can tolerate
+ * a failure to allocate the vma_list and just assume that all
+ * vmas are not hugetlb-vmas.
*/
vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL);
- if (!vma_list)
+ if (!vma_list) {
+ if (IS_ENABLED(CONFIG_FS_DAX))
+ goto err_vmalist;
umem->hugetlb = 0;
+ }
npages = ib_umem_num_pages(umem);
@@ -199,15 +201,34 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
if (ret < 0)
goto out;
- umem->npages += ret;
cur_base += ret * PAGE_SIZE;
npages -= ret;
for_each_sg(sg_list_start, sg, ret, i) {
- if (vma_list && !is_vm_hugetlb_page(vma_list[i]))
- umem->hugetlb = 0;
+ struct vm_area_struct *vma;
+ struct inode *inode;
sg_set_page(sg, page_list[i], PAGE_SIZE, 0);
+ umem->npages++;
+
+ if (!vma_list)
+ continue;
+ vma = vma_list[i];
+
+ if (!is_vm_hugetlb_page(vma))
+ umem->hugetlb = 0;
+
+ if (!vma_is_dax(vma))
+ continue;
+
+ /* device-dax is safe for rdma... */
+ inode = file_inode(vma->vm_file);
+ if (inode->i_mode == S_IFCHR)
+ continue;
+
+ /* ...filesystem-dax is not. */
+ ret = -EOPNOTSUPP;
+ goto out;
}
/* preparing for next loop */
@@ -242,6 +263,12 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
free_page((unsigned long) page_list);
return ret < 0 ? ERR_PTR(ret) : umem;
+err_vmalist:
+ free_page((unsigned long) page_list);
+err_pagelist:
+ put_pid(umem->pid);
+ kfree(umem);
+ return ERR_PTR(-ENOMEM);
}
EXPORT_SYMBOL(ib_umem_get);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH 11/15] [media] v4l2: disable filesystem-dax mapping support
2017-10-31 23:21 [PATCH 00/15] dax: prep work for fixing dax-dma vs truncate collisions Dan Williams
2017-10-31 23:22 ` [PATCH 10/15] IB/core: disable memory registration of fileystem-dax vmas Dan Williams
@ 2017-10-31 23:22 ` Dan Williams
1 sibling, 0 replies; 5+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
To: linux-nvdimm
Cc: Jan Kara, linux-kernel, stable, linux-xfs, linux-mm,
linux-fsdevel, akpm, Mauro Carvalho Chehab, hch, linux-media
V4L2 memory registrations are incompatible with filesystem-dax that
needs the ability to revoke dma access to a mapping at will, or
otherwise allow the kernel to wait for completion of DMA. The
filesystem-dax implementation breaks the traditional solution of
truncate of active file backed mappings since there is no page-cache
page we can orphan to sustain ongoing DMA.
If v4l2 wants to support long lived DMA mappings it needs to arrange to
hold a file lease or use some other mechanism so that the kernel can
coordinate revoking DMA access when the filesystem needs to truncate
mappings.
Reported-by: Jan Kara <jack@suse.cz>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: linux-media@vger.kernel.org
Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
drivers/media/v4l2-core/videobuf-dma-sg.c | 39 ++++++++++++++++++++++++++++-
1 file changed, 37 insertions(+), 2 deletions(-)
diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 0b5c43f7e020..37a4ae61b2c0 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -155,8 +155,9 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
int direction, unsigned long data, unsigned long size)
{
unsigned long first, last;
- int err, rw = 0;
+ int err, rw = 0, i, nr_pages;
unsigned int flags = FOLL_FORCE;
+ struct vm_area_struct **vmas = NULL;
dma->direction = direction;
switch (dma->direction) {
@@ -179,6 +180,16 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
if (NULL == dma->pages)
return -ENOMEM;
+ if (IS_ENABLED(CONFIG_FS_DAX)) {
+ vmas = kmalloc(dma->nr_pages * sizeof(struct vm_area_struct *),
+ GFP_KERNEL);
+ if (NULL == vmas) {
+ kfree(dma->pages);
+ dma->pages = NULL;
+ return -ENOMEM;
+ }
+ }
+
if (rw == READ)
flags |= FOLL_WRITE;
@@ -186,7 +197,31 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
data, size, dma->nr_pages);
err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
- flags, dma->pages, NULL);
+ flags, dma->pages, vmas);
+ nr_pages = err;
+
+ for (i = 0; vmas && i < nr_pages; i++) {
+ struct vm_area_struct *vma = vmas[i];
+ struct inode *inode;
+
+ if (!vma_is_dax(vma))
+ continue;
+
+ /* device-dax is safe for long-lived v4l2 mappings... */
+ inode = file_inode(vma->vm_file);
+ if (inode->i_mode == S_IFCHR)
+ continue;
+
+ /* ...filesystem-dax is not. */
+ err = -EOPNOTSUPP;
+ break;
+
+ /*
+ * FIXME: add a 'with lease' mechanism for v4l2 to
+ * obtain time bounded access to filesytem-dax mappings
+ */
+ }
+ kfree(vmas);
if (err != dma->nr_pages) {
dma->nr_pages = (err >= 0) ? err : 0;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH 10/15] IB/core: disable memory registration of fileystem-dax vmas
2017-10-31 23:22 ` [PATCH 10/15] IB/core: disable memory registration of fileystem-dax vmas Dan Williams
@ 2017-11-02 20:13 ` Christoph Hellwig
2017-11-02 21:06 ` Dan Williams
0 siblings, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2017-11-02 20:13 UTC (permalink / raw)
To: Dan Williams
Cc: linux-nvdimm, Sean Hefty, linux-xfs, akpm, linux-rdma,
linux-kernel, Jeff Moyer, stable, hch, Jason Gunthorpe, linux-mm,
Doug Ledford, linux-fsdevel, Ross Zwisler, Hal Rosenstock
Any chance we could add a new get_user_pages_longerm or similar
helper instead of opencoding this in the various callers?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 10/15] IB/core: disable memory registration of fileystem-dax vmas
2017-11-02 20:13 ` Christoph Hellwig
@ 2017-11-02 21:06 ` Dan Williams
0 siblings, 0 replies; 5+ messages in thread
From: Dan Williams @ 2017-11-02 21:06 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-nvdimm@lists.01.org, Sean Hefty, linux-xfs, Andrew Morton,
linux-rdma, linux-kernel@vger.kernel.org, Jeff Moyer,
stable@vger.kernel.org, Jason Gunthorpe, Linux MM, Doug Ledford,
linux-fsdevel, Ross Zwisler, Hal Rosenstock
On Thu, Nov 2, 2017 at 1:13 PM, Christoph Hellwig <hch@lst.de> wrote:
> Any chance we could add a new get_user_pages_longerm or similar
> helper instead of opencoding this in the various callers?
Sounds like a great idea to me, I'll take a look...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2017-11-02 21:06 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-31 23:21 [PATCH 00/15] dax: prep work for fixing dax-dma vs truncate collisions Dan Williams
2017-10-31 23:22 ` [PATCH 10/15] IB/core: disable memory registration of fileystem-dax vmas Dan Williams
2017-11-02 20:13 ` Christoph Hellwig
2017-11-02 21:06 ` Dan Williams
2017-10-31 23:22 ` [PATCH 11/15] [media] v4l2: disable filesystem-dax mapping support Dan Williams
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox