* [PATCH 1/2] dax: prevent invalidation of mapped DAX entries [not found] <20170420191446.GA21694@linux.intel.com> @ 2017-04-21 3:44 ` Ross Zwisler 2017-04-21 3:44 ` [PATCH 2/2] dax: fix data corruption due to stale mmap reads Ross Zwisler ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Ross Zwisler @ 2017-04-21 3:44 UTC (permalink / raw) To: Andrew Morton, linux-kernel Cc: Latchesar Ionkov, Jan Kara, Trond Myklebust, linux-mm, Christoph Hellwig, linux-cifs, Matthew Wilcox, Andrey Ryabinin, Eric Van Hensbergen, linux-nvdimm, Alexander Viro, v9fs-developer, Jens Axboe, linux-nfs, Darrick J. Wong, samba-technical, Steve French, Alexey Kuznetsov, Johannes Weiner, linux-fsdevel, Ron Minnich, Anna Schumaker dax_invalidate_mapping_entry() currently removes DAX exceptional entries only if they are clean and unlocked. This is done via: invalidate_mapping_pages() invalidate_exceptional_entry() dax_invalidate_mapping_entry() However, for page cache pages removed in invalidate_mapping_pages() there is an additional criteria which is that the page must not be mapped. This is noted in the comments above invalidate_mapping_pages() and is checked in invalidate_inode_page(). For DAX entries this means that we can can end up in a situation where a DAX exceptional entry, either a huge zero page or a regular DAX entry, could end up mapped but without an associated radix tree entry. This is inconsistent with the rest of the DAX code and with what happens in the page cache case. We aren't able to unmap the DAX exceptional entry because according to its comments invalidate_mapping_pages() isn't allowed to block, and unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem. Since we essentially never have unmapped DAX entries to evict from the radix tree, just remove dax_invalidate_mapping_entry(). Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate") Reported-by: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> [4.10+] --- This series applies cleanly to the current v4.11-rc7 based linux/master, and has passed an xfstests run with DAX on ext4 and XFS. These patches also apply to v4.10.9 with a little work from the 3-way merge feature. fs/dax.c | 29 ----------------------------- include/linux/dax.h | 1 - mm/truncate.c | 9 +++------ 3 files changed, 3 insertions(+), 36 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 85abd74..166504c 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -507,35 +507,6 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index) } /* - * Invalidate exceptional DAX entry if easily possible. This handles DAX - * entries for invalidate_inode_pages() so we evict the entry only if we can - * do so without blocking. - */ -int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index) -{ - int ret = 0; - void *entry, **slot; - struct radix_tree_root *page_tree = &mapping->page_tree; - - spin_lock_irq(&mapping->tree_lock); - entry = __radix_tree_lookup(page_tree, index, NULL, &slot); - if (!entry || !radix_tree_exceptional_entry(entry) || - slot_locked(mapping, slot)) - goto out; - if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) || - radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) - goto out; - radix_tree_delete(page_tree, index); - mapping->nrexceptional--; - ret = 1; -out: - spin_unlock_irq(&mapping->tree_lock); - if (ret) - dax_wake_mapping_entry_waiter(mapping, index, entry, true); - return ret; -} - -/* * Invalidate exceptional DAX entry if it is clean. */ int dax_invalidate_mapping_entry_sync(struct address_space *mapping, diff --git a/include/linux/dax.h b/include/linux/dax.h index d8a3dc0..f8e1833 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -41,7 +41,6 @@ ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter, int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size, const struct iomap_ops *ops); int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index); -int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index); int dax_invalidate_mapping_entry_sync(struct address_space *mapping, pgoff_t index); void dax_wake_mapping_entry_waiter(struct address_space *mapping, diff --git a/mm/truncate.c b/mm/truncate.c index 6263aff..c537184 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -67,17 +67,14 @@ static void truncate_exceptional_entry(struct address_space *mapping, /* * Invalidate exceptional entry if easily possible. This handles exceptional - * entries for invalidate_inode_pages() so for DAX it evicts only unlocked and - * clean entries. + * entries for invalidate_inode_pages(). */ static int invalidate_exceptional_entry(struct address_space *mapping, pgoff_t index, void *entry) { - /* Handled by shmem itself */ - if (shmem_mapping(mapping)) + /* Handled by shmem itself, or for DAX we do nothing. */ + if (shmem_mapping(mapping) || dax_mapping(mapping)) return 1; - if (dax_mapping(mapping)) - return dax_invalidate_mapping_entry(mapping, index); clear_shadow_entry(mapping, index, entry); return 1; } -- 2.9.3 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 2/2] dax: fix data corruption due to stale mmap reads 2017-04-21 3:44 ` [PATCH 1/2] dax: prevent invalidation of mapped DAX entries Ross Zwisler @ 2017-04-21 3:44 ` Ross Zwisler 2017-04-25 11:10 ` Jan Kara 2017-04-24 17:49 ` [PATCH 1/2] xfs: fix incorrect argument count check Ross Zwisler 2017-04-25 10:10 ` [PATCH 1/2] dax: prevent invalidation of mapped DAX entries Jan Kara 2 siblings, 1 reply; 17+ messages in thread From: Ross Zwisler @ 2017-04-21 3:44 UTC (permalink / raw) To: Andrew Morton, linux-kernel Cc: Latchesar Ionkov, Jan Kara, Trond Myklebust, linux-mm, Christoph Hellwig, linux-cifs, Matthew Wilcox, Andrey Ryabinin, Eric Van Hensbergen, linux-nvdimm, Alexander Viro, v9fs-developer, Jens Axboe, linux-nfs, Darrick J. Wong, samba-technical, Steve French, Alexey Kuznetsov, Johannes Weiner, linux-fsdevel, Ron Minnich, Anna Schumaker Users of DAX can suffer data corruption from stale mmap reads via the following sequence: - open an mmap over a 2MiB hole - read from a 2MiB hole, faulting in a 2MiB zero page - write to the hole with write(3p). The write succeeds but we incorrectly leave the 2MiB zero page mapping intact. - via the mmap, read the data that was just written. Since the zero page mapping is still intact we read back zeroes instead of the new data. We fix this by unconditionally calling invalidate_inode_pages2_range() in dax_iomap_actor() for new block allocations, and by enhancing __dax_invalidate_mapping_entry() so that it properly unmaps the DAX entry being removed from the radix tree. This is based on an initial patch from Jan Kara. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate") Reported-by: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> [4.10+] --- fs/dax.c | 26 +++++++++++++++++++------- 1 file changed, 19 insertions(+), 7 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 166504c..3f445d5 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -468,23 +468,35 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index, bool trunc) { int ret = 0; - void *entry; + void *entry, **slot; struct radix_tree_root *page_tree = &mapping->page_tree; spin_lock_irq(&mapping->tree_lock); - entry = get_unlocked_mapping_entry(mapping, index, NULL); + entry = get_unlocked_mapping_entry(mapping, index, &slot); if (!entry || !radix_tree_exceptional_entry(entry)) goto out; if (!trunc && (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) || radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))) goto out; + + /* + * Make sure 'entry' remains valid while we drop mapping->tree_lock to + * do the unmap_mapping_range() call. + */ + entry = lock_slot(mapping, slot); + spin_unlock_irq(&mapping->tree_lock); + + unmap_mapping_range(mapping, (loff_t)index << PAGE_SHIFT, + (loff_t)PAGE_SIZE << dax_radix_order(entry), 0); + + spin_lock_irq(&mapping->tree_lock); radix_tree_delete(page_tree, index); mapping->nrexceptional--; ret = 1; out: - put_unlocked_mapping_entry(mapping, index, entry); spin_unlock_irq(&mapping->tree_lock); + dax_wake_mapping_entry_waiter(mapping, index, entry, true); return ret; } /* @@ -999,11 +1011,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, return -EIO; /* - * Write can allocate block for an area which has a hole page mapped - * into page tables. We have to tear down these mappings so that data - * written by write(2) is visible in mmap. + * Write can allocate block for an area which has a hole page or zero + * PMD entry in the radix tree. We have to tear down these mappings so + * that data written by write(2) is visible in mmap. */ - if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) { + if (iomap->flags & IOMAP_F_NEW) { invalidate_inode_pages2_range(inode->i_mapping, pos >> PAGE_SHIFT, (end - 1) >> PAGE_SHIFT); -- 2.9.3 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads 2017-04-21 3:44 ` [PATCH 2/2] dax: fix data corruption due to stale mmap reads Ross Zwisler @ 2017-04-25 11:10 ` Jan Kara 2017-04-25 22:59 ` Ross Zwisler 0 siblings, 1 reply; 17+ messages in thread From: Jan Kara @ 2017-04-25 11:10 UTC (permalink / raw) To: Ross Zwisler Cc: Latchesar Ionkov, Jan Kara, Trond Myklebust, linux-mm, Christoph Hellwig, linux-cifs, Matthew Wilcox, Andrey Ryabinin, Eric Van Hensbergen, linux-nvdimm, Alexander Viro, v9fs-developer, Jens Axboe, linux-nfs, Darrick J. Wong, samba-technical, linux-kernel, Steve French, Alexey Kuznetsov, Johannes Weiner, linux-fsdevel, Ron Minnich, Andrew Morton, Anna Schumaker On Thu 20-04-17 21:44:37, Ross Zwisler wrote: > Users of DAX can suffer data corruption from stale mmap reads via the > following sequence: > > - open an mmap over a 2MiB hole > > - read from a 2MiB hole, faulting in a 2MiB zero page > > - write to the hole with write(3p). The write succeeds but we incorrectly > leave the 2MiB zero page mapping intact. > > - via the mmap, read the data that was just written. Since the zero page > mapping is still intact we read back zeroes instead of the new data. > > We fix this by unconditionally calling invalidate_inode_pages2_range() in > dax_iomap_actor() for new block allocations, and by enhancing > __dax_invalidate_mapping_entry() so that it properly unmaps the DAX entry > being removed from the radix tree. > > This is based on an initial patch from Jan Kara. > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate") > Reported-by: Jan Kara <jack@suse.cz> > Cc: <stable@vger.kernel.org> [4.10+] > --- > fs/dax.c | 26 +++++++++++++++++++------- > 1 file changed, 19 insertions(+), 7 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 166504c..3f445d5 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -468,23 +468,35 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping, > pgoff_t index, bool trunc) > { > int ret = 0; > - void *entry; > + void *entry, **slot; > struct radix_tree_root *page_tree = &mapping->page_tree; > > spin_lock_irq(&mapping->tree_lock); > - entry = get_unlocked_mapping_entry(mapping, index, NULL); > + entry = get_unlocked_mapping_entry(mapping, index, &slot); > if (!entry || !radix_tree_exceptional_entry(entry)) > goto out; > if (!trunc && > (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) || > radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))) > goto out; > + > + /* > + * Make sure 'entry' remains valid while we drop mapping->tree_lock to > + * do the unmap_mapping_range() call. > + */ > + entry = lock_slot(mapping, slot); This also stops page faults from mapping the entry again. Maybe worth mentioning here as well. > + spin_unlock_irq(&mapping->tree_lock); > + > + unmap_mapping_range(mapping, (loff_t)index << PAGE_SHIFT, > + (loff_t)PAGE_SIZE << dax_radix_order(entry), 0); Ouch, unmapping entry-by-entry may get quite expensive if you are unmapping large ranges - each unmap means an rmap walk... Since this is a data corruption class of bug, let's fix it this way for now but I think we'll need to improve this later. E.g. what if we called unmap_mapping_range() for the whole invalidated range after removing the radix tree entries? Hum, but now thinking more about it I have hard time figuring out why write vs fault cannot actually still race: CPU1 - write(2) CPU2 - read fault dax_iomap_pte_fault() ->iomap_begin() - sees hole dax_iomap_rw() iomap_apply() ->iomap_begin - allocates blocks dax_iomap_actor() invalidate_inode_pages2_range() - there's nothing to invalidate grab_mapping_entry() - we add zero page in the radix tree & map it to page tables Similarly read vs write fault may end up racing in a wrong way and try to replace already existing exceptional entry with a hole page? Honza > + > + spin_lock_irq(&mapping->tree_lock); > radix_tree_delete(page_tree, index); > mapping->nrexceptional--; > ret = 1; > out: > - put_unlocked_mapping_entry(mapping, index, entry); > spin_unlock_irq(&mapping->tree_lock); > + dax_wake_mapping_entry_waiter(mapping, index, entry, true); > return ret; > } > /* > @@ -999,11 +1011,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > return -EIO; > > /* > - * Write can allocate block for an area which has a hole page mapped > - * into page tables. We have to tear down these mappings so that data > - * written by write(2) is visible in mmap. > + * Write can allocate block for an area which has a hole page or zero > + * PMD entry in the radix tree. We have to tear down these mappings so > + * that data written by write(2) is visible in mmap. > */ > - if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) { > + if (iomap->flags & IOMAP_F_NEW) { > invalidate_inode_pages2_range(inode->i_mapping, > pos >> PAGE_SHIFT, > (end - 1) >> PAGE_SHIFT); > -- > 2.9.3 > -- Jan Kara <jack@suse.com> SUSE Labs, CR _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads 2017-04-25 11:10 ` Jan Kara @ 2017-04-25 22:59 ` Ross Zwisler 2017-04-26 8:52 ` Jan Kara 0 siblings, 1 reply; 17+ messages in thread From: Ross Zwisler @ 2017-04-25 22:59 UTC (permalink / raw) To: Jan Kara Cc: Latchesar Ionkov, Trond Myklebust, linux-mm, Christoph Hellwig, linux-cifs, Matthew Wilcox, Andrey Ryabinin, Eric Van Hensbergen, linux-nvdimm, Alexander Viro, v9fs-developer, Jens Axboe, linux-nfs, Darrick J. Wong, samba-technical, linux-kernel, Steve French, Alexey Kuznetsov, Johannes Weiner, linux-fsdevel, Ron Minnich, Andrew Morton, Anna Schumaker On Tue, Apr 25, 2017 at 01:10:43PM +0200, Jan Kara wrote: <> > Hum, but now thinking more about it I have hard time figuring out why write > vs fault cannot actually still race: > > CPU1 - write(2) CPU2 - read fault > > dax_iomap_pte_fault() > ->iomap_begin() - sees hole > dax_iomap_rw() > iomap_apply() > ->iomap_begin - allocates blocks > dax_iomap_actor() > invalidate_inode_pages2_range() > - there's nothing to invalidate > grab_mapping_entry() > - we add zero page in the radix > tree & map it to page tables > > Similarly read vs write fault may end up racing in a wrong way and try to > replace already existing exceptional entry with a hole page? Yep, this race seems real to me, too. This seems very much like the issues that exist when a thread is doing direct I/O. One thread is doing I/O to an intermediate buffer (page cache for direct I/O case, zero page for us), and the other is going around it directly to media, and they can get out of sync. IIRC the direct I/O code looked something like: 1/ invalidate existing mappings 2/ do direct I/O to media 3/ invalidate mappings again, just in case. Should be cheap if there weren't any conflicting faults. This makes sure any new allocations we made are faulted in. I guess one option would be to replicate that logic in the DAX I/O path, or we could try and enhance our locking so page faults can't race with I/O since both can allocate blocks. I'm not sure, but will think on it. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads 2017-04-25 22:59 ` Ross Zwisler @ 2017-04-26 8:52 ` Jan Kara 2017-04-26 22:52 ` Ross Zwisler 0 siblings, 1 reply; 17+ messages in thread From: Jan Kara @ 2017-04-26 8:52 UTC (permalink / raw) To: Ross Zwisler Cc: Latchesar Ionkov, Jan Kara, Trond Myklebust, linux-mm, Christoph Hellwig, linux-cifs, Matthew Wilcox, Andrey Ryabinin, Eric Van Hensbergen, linux-nvdimm, Alexander Viro, v9fs-developer, Jens Axboe, linux-nfs, Darrick J. Wong, samba-technical, linux-kernel, Steve French, Alexey Kuznetsov, Johannes Weiner, linux-fsdevel, Ron Minnich, Andrew Morton, Anna Schumaker On Tue 25-04-17 16:59:36, Ross Zwisler wrote: > On Tue, Apr 25, 2017 at 01:10:43PM +0200, Jan Kara wrote: > <> > > Hum, but now thinking more about it I have hard time figuring out why write > > vs fault cannot actually still race: > > > > CPU1 - write(2) CPU2 - read fault > > > > dax_iomap_pte_fault() > > ->iomap_begin() - sees hole > > dax_iomap_rw() > > iomap_apply() > > ->iomap_begin - allocates blocks > > dax_iomap_actor() > > invalidate_inode_pages2_range() > > - there's nothing to invalidate > > grab_mapping_entry() > > - we add zero page in the radix > > tree & map it to page tables > > > > Similarly read vs write fault may end up racing in a wrong way and try to > > replace already existing exceptional entry with a hole page? > > Yep, this race seems real to me, too. This seems very much like the issues > that exist when a thread is doing direct I/O. One thread is doing I/O to an > intermediate buffer (page cache for direct I/O case, zero page for us), and > the other is going around it directly to media, and they can get out of sync. > > IIRC the direct I/O code looked something like: > > 1/ invalidate existing mappings > 2/ do direct I/O to media > 3/ invalidate mappings again, just in case. Should be cheap if there weren't > any conflicting faults. This makes sure any new allocations we made are > faulted in. Yeah, the problem is people generally expect weird behavior when they mix direct and buffered IO (or let alone mmap) however everyone expects standard read(2) and write(2) to be completely coherent with mmap(2). > I guess one option would be to replicate that logic in the DAX I/O path, or we > could try and enhance our locking so page faults can't race with I/O since > both can allocate blocks. In the abstract way, the problem is that we have radix tree (and page tables) cache block mapping information and the operation: "read block mapping information, store it in the radix tree" is not serialized in any way against other block allocations so the information we store can be out of date by the time we store it. One way to solve this would be to move ->iomap_begin call in the fault paths under entry lock although that would mean I have to redo how ext4 handles DAX faults because with current code it would create lock inversion wrt transaction start. Another solution would be to grab i_mmap_sem for write when doing write fault of a page and similarly have it grabbed for writing when doing write(2). This would scale rather poorly but if we later replaced it with a range lock (Davidlohr has already posted a nice implementation of it) it won't be as bad. But I guess option 1) is better... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads 2017-04-26 8:52 ` Jan Kara @ 2017-04-26 22:52 ` Ross Zwisler 2017-04-27 7:26 ` Jan Kara 0 siblings, 1 reply; 17+ messages in thread From: Ross Zwisler @ 2017-04-26 22:52 UTC (permalink / raw) To: Jan Kara Cc: Latchesar Ionkov, Trond Myklebust, linux-mm, Christoph Hellwig, linux-cifs, Matthew Wilcox, Andrey Ryabinin, Eric Van Hensbergen, linux-nvdimm, Alexander Viro, v9fs-developer, Jens Axboe, linux-nfs, Darrick J. Wong, samba-technical, linux-kernel, Steve French, Alexey Kuznetsov, Johannes Weiner, linux-fsdevel, Ron Minnich, Andrew Morton, Anna Schumaker On Wed, Apr 26, 2017 at 10:52:35AM +0200, Jan Kara wrote: > On Tue 25-04-17 16:59:36, Ross Zwisler wrote: > > On Tue, Apr 25, 2017 at 01:10:43PM +0200, Jan Kara wrote: > > <> > > > Hum, but now thinking more about it I have hard time figuring out why write > > > vs fault cannot actually still race: > > > > > > CPU1 - write(2) CPU2 - read fault > > > > > > dax_iomap_pte_fault() > > > ->iomap_begin() - sees hole > > > dax_iomap_rw() > > > iomap_apply() > > > ->iomap_begin - allocates blocks > > > dax_iomap_actor() > > > invalidate_inode_pages2_range() > > > - there's nothing to invalidate > > > grab_mapping_entry() > > > - we add zero page in the radix > > > tree & map it to page tables > > > > > > Similarly read vs write fault may end up racing in a wrong way and try to > > > replace already existing exceptional entry with a hole page? > > > > Yep, this race seems real to me, too. This seems very much like the issues > > that exist when a thread is doing direct I/O. One thread is doing I/O to an > > intermediate buffer (page cache for direct I/O case, zero page for us), and > > the other is going around it directly to media, and they can get out of sync. > > > > IIRC the direct I/O code looked something like: > > > > 1/ invalidate existing mappings > > 2/ do direct I/O to media > > 3/ invalidate mappings again, just in case. Should be cheap if there weren't > > any conflicting faults. This makes sure any new allocations we made are > > faulted in. > > Yeah, the problem is people generally expect weird behavior when they mix > direct and buffered IO (or let alone mmap) however everyone expects > standard read(2) and write(2) to be completely coherent with mmap(2). Yep, fair enough. > > I guess one option would be to replicate that logic in the DAX I/O path, or we > > could try and enhance our locking so page faults can't race with I/O since > > both can allocate blocks. > > In the abstract way, the problem is that we have radix tree (and page > tables) cache block mapping information and the operation: "read block > mapping information, store it in the radix tree" is not serialized in any > way against other block allocations so the information we store can be out > of date by the time we store it. > > One way to solve this would be to move ->iomap_begin call in the fault > paths under entry lock although that would mean I have to redo how ext4 > handles DAX faults because with current code it would create lock inversion > wrt transaction start. I don't think this alone is enough to save us. The I/O path doesn't currently take any DAX radix tree entry locks, so our race would just become: CPU1 - write(2) CPU2 - read fault dax_iomap_pte_fault() grab_mapping_entry() // newly moved ->iomap_begin() - sees hole dax_iomap_rw() iomap_apply() ->iomap_begin - allocates blocks dax_iomap_actor() invalidate_inode_pages2_range() - there's nothing to invalidate - we add zero page in the radix tree & map it to page tables In their current form I don't think we want to take DAX radix tree entry locks in the I/O path because that would effectively serialize I/O over a given radix tree entry. For a 2MiB entry, for example, all I/O to that 2MiB range would be serialized. > Another solution would be to grab i_mmap_sem for write when doing write > fault of a page and similarly have it grabbed for writing when doing > write(2). This would scale rather poorly but if we later replaced it with a > range lock (Davidlohr has already posted a nice implementation of it) it > won't be as bad. But I guess option 1) is better... The best idea I had for handling this sounds similar, which would be to convert the radix tree locks to essentially be reader/writer locks. I/O and faults that don't modify the block mapping could just take read-level locks, and could all run concurrently. I/O or faults that modify a block mapping would take a write lock, and serialize with other writers and readers. You could know if you needed a write lock without asking the filesystem - if you're a write and the radix tree entry is empty or is for a zero page, you grab the write lock. This dovetails nicely with the idea of having the radix tree act as a cache for block mappings. You take the appropriate lock on the radix tree entry, and it has the block mapping info for your I/O or fault so you don't have to call into the FS. I/O would also participate so we would keep info about block mappings that we gather from I/O to help shortcut our page faults. How does this sound vs the range lock idea? How hard do you think it would be to convert our current wait queue system to reader/writer style locking? Also, how do you think we should deal with the current PMD corruption? Should we go with the current fix (I can augment the comments as you suggested), and then handle optimizations to that approach and the solution to this larger race as a follow-on? _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads 2017-04-26 22:52 ` Ross Zwisler @ 2017-04-27 7:26 ` Jan Kara 2017-05-01 22:38 ` Ross Zwisler 2017-05-01 22:59 ` Dan Williams 0 siblings, 2 replies; 17+ messages in thread From: Jan Kara @ 2017-04-27 7:26 UTC (permalink / raw) To: Ross Zwisler Cc: Latchesar Ionkov, Jan Kara, Trond Myklebust, linux-mm, Christoph Hellwig, linux-cifs, Matthew Wilcox, Andrey Ryabinin, Eric Van Hensbergen, linux-nvdimm, Alexander Viro, v9fs-developer, Jens Axboe, linux-nfs, Darrick J. Wong, samba-technical, linux-kernel, Steve French, Alexey Kuznetsov, Johannes Weiner, linux-fsdevel, Ron Minnich, Andrew Morton, Anna Schumaker On Wed 26-04-17 16:52:36, Ross Zwisler wrote: > On Wed, Apr 26, 2017 at 10:52:35AM +0200, Jan Kara wrote: > > On Tue 25-04-17 16:59:36, Ross Zwisler wrote: > > > On Tue, Apr 25, 2017 at 01:10:43PM +0200, Jan Kara wrote: > > > <> > > > > Hum, but now thinking more about it I have hard time figuring out why write > > > > vs fault cannot actually still race: > > > > > > > > CPU1 - write(2) CPU2 - read fault > > > > > > > > dax_iomap_pte_fault() > > > > ->iomap_begin() - sees hole > > > > dax_iomap_rw() > > > > iomap_apply() > > > > ->iomap_begin - allocates blocks > > > > dax_iomap_actor() > > > > invalidate_inode_pages2_range() > > > > - there's nothing to invalidate > > > > grab_mapping_entry() > > > > - we add zero page in the radix > > > > tree & map it to page tables > > > > > > > > Similarly read vs write fault may end up racing in a wrong way and try to > > > > replace already existing exceptional entry with a hole page? > > > > > > Yep, this race seems real to me, too. This seems very much like the issues > > > that exist when a thread is doing direct I/O. One thread is doing I/O to an > > > intermediate buffer (page cache for direct I/O case, zero page for us), and > > > the other is going around it directly to media, and they can get out of sync. > > > > > > IIRC the direct I/O code looked something like: > > > > > > 1/ invalidate existing mappings > > > 2/ do direct I/O to media > > > 3/ invalidate mappings again, just in case. Should be cheap if there weren't > > > any conflicting faults. This makes sure any new allocations we made are > > > faulted in. > > > > Yeah, the problem is people generally expect weird behavior when they mix > > direct and buffered IO (or let alone mmap) however everyone expects > > standard read(2) and write(2) to be completely coherent with mmap(2). > > Yep, fair enough. > > > > I guess one option would be to replicate that logic in the DAX I/O path, or we > > > could try and enhance our locking so page faults can't race with I/O since > > > both can allocate blocks. > > > > In the abstract way, the problem is that we have radix tree (and page > > tables) cache block mapping information and the operation: "read block > > mapping information, store it in the radix tree" is not serialized in any > > way against other block allocations so the information we store can be out > > of date by the time we store it. > > > > One way to solve this would be to move ->iomap_begin call in the fault > > paths under entry lock although that would mean I have to redo how ext4 > > handles DAX faults because with current code it would create lock inversion > > wrt transaction start. > > I don't think this alone is enough to save us. The I/O path doesn't currently > take any DAX radix tree entry locks, so our race would just become: > > CPU1 - write(2) CPU2 - read fault > > dax_iomap_pte_fault() > grab_mapping_entry() // newly moved > ->iomap_begin() - sees hole > dax_iomap_rw() > iomap_apply() > ->iomap_begin - allocates blocks > dax_iomap_actor() > invalidate_inode_pages2_range() > - there's nothing to invalidate > - we add zero page in the radix > tree & map it to page tables > > In their current form I don't think we want to take DAX radix tree entry locks > in the I/O path because that would effectively serialize I/O over a given > radix tree entry. For a 2MiB entry, for example, all I/O to that 2MiB range > would be serialized. Note that invalidate_inode_pages2_range() will see the entry created by grab_mapping_entry() on CPU2 and block waiting for its lock and this is exactly what stops the race. The invalidate_inode_pages2_range() effectively makes sure there isn't any page fault in progress for given range... Also note that writes to a file are serialized by i_rwsem anyway (and at least serialization of writes to the overlapping range is required by POSIX) so this doesn't add any more serialization than we already have. > > Another solution would be to grab i_mmap_sem for write when doing write > > fault of a page and similarly have it grabbed for writing when doing > > write(2). This would scale rather poorly but if we later replaced it with a > > range lock (Davidlohr has already posted a nice implementation of it) it > > won't be as bad. But I guess option 1) is better... > > The best idea I had for handling this sounds similar, which would be to > convert the radix tree locks to essentially be reader/writer locks. I/O and > faults that don't modify the block mapping could just take read-level locks, > and could all run concurrently. I/O or faults that modify a block mapping > would take a write lock, and serialize with other writers and readers. Well, this would be difficult to implement inside the radix tree (not enough bits in the entry) so you'd have to go for some external locking primitive anyway. And if you do that, read-write range lock Davidlohr has implemented is what you describe - well we could also have a radix tree with rwsems but I suspect the overhead of maintaining that would be too large. It would require larger rewrite than reusing entry locks as I suggest above though and it isn't an obvious performance win for realistic workloads either so I'd like to see some performance numbers before going that way. It likely improves a situation where processes race to fault the same page for which we already know the block mapping but I'm not sure if that translates to any measurable performance wins for workloads on DAX filesystem. > You could know if you needed a write lock without asking the filesystem - if > you're a write and the radix tree entry is empty or is for a zero page, you > grab the write lock. > > This dovetails nicely with the idea of having the radix tree act as a cache > for block mappings. You take the appropriate lock on the radix tree entry, > and it has the block mapping info for your I/O or fault so you don't have to > call into the FS. I/O would also participate so we would keep info about > block mappings that we gather from I/O to help shortcut our page faults. > > How does this sound vs the range lock idea? How hard do you think it would be > to convert our current wait queue system to reader/writer style locking? > > Also, how do you think we should deal with the current PMD corruption? Should > we go with the current fix (I can augment the comments as you suggested), and > then handle optimizations to that approach and the solution to this larger > race as a follow-on? So for now I'm still more inclined to just stay with the radix tree lock as is and just fix up the locking as I suggest and go for larger rewrite only if we can demonstrate further performance wins. WRT your second patch, if we go with the locking as I suggest, it is enough to unmap the whole range after invalidate_inode_pages2() has cleared radix tree entries (*) which will be much cheaper (for large writes) than doing unmapping entry by entry. So I'd go for that. I'll prepare a patch for the locking change - it will require changes to ext4 transaction handling so it won't be completely trivial. (*) The flow of information is: filesystem block mapping info -> radix tree -> page tables so if 'filesystem block mapping info' changes, we should go invalidate corresponding radix tree entries (new entries will already have uptodate info) and then invalidate corresponding page tables (again once radix tree has no stale entries, we are sure new page table entries will be uptodate). Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads 2017-04-27 7:26 ` Jan Kara @ 2017-05-01 22:38 ` Ross Zwisler 2017-05-04 9:12 ` Jan Kara 2017-05-01 22:59 ` Dan Williams 1 sibling, 1 reply; 17+ messages in thread From: Ross Zwisler @ 2017-05-01 22:38 UTC (permalink / raw) To: Jan Kara Cc: Latchesar Ionkov, Trond Myklebust, linux-mm, Christoph Hellwig, linux-cifs, Matthew Wilcox, Andrey Ryabinin, Eric Van Hensbergen, linux-nvdimm, Alexander Viro, v9fs-developer, Jens Axboe, linux-nfs, Darrick J. Wong, samba-technical, linux-kernel, Steve French, Alexey Kuznetsov, Johannes Weiner, linux-fsdevel, Ron Minnich, Andrew Morton, Anna Schumaker On Thu, Apr 27, 2017 at 09:26:59AM +0200, Jan Kara wrote: > On Wed 26-04-17 16:52:36, Ross Zwisler wrote: <> > > I don't think this alone is enough to save us. The I/O path doesn't currently > > take any DAX radix tree entry locks, so our race would just become: > > > > CPU1 - write(2) CPU2 - read fault > > > > dax_iomap_pte_fault() > > grab_mapping_entry() // newly moved > > ->iomap_begin() - sees hole > > dax_iomap_rw() > > iomap_apply() > > ->iomap_begin - allocates blocks > > dax_iomap_actor() > > invalidate_inode_pages2_range() > > - there's nothing to invalidate > > - we add zero page in the radix > > tree & map it to page tables > > > > In their current form I don't think we want to take DAX radix tree entry locks > > in the I/O path because that would effectively serialize I/O over a given > > radix tree entry. For a 2MiB entry, for example, all I/O to that 2MiB range > > would be serialized. > > Note that invalidate_inode_pages2_range() will see the entry created by > grab_mapping_entry() on CPU2 and block waiting for its lock and this is > exactly what stops the race. The invalidate_inode_pages2_range() > effectively makes sure there isn't any page fault in progress for given > range... Yep, this is the bit that I was missing. Thanks. > Also note that writes to a file are serialized by i_rwsem anyway (and at > least serialization of writes to the overlapping range is required by POSIX) > so this doesn't add any more serialization than we already have. > > > > Another solution would be to grab i_mmap_sem for write when doing write > > > fault of a page and similarly have it grabbed for writing when doing > > > write(2). This would scale rather poorly but if we later replaced it with a > > > range lock (Davidlohr has already posted a nice implementation of it) it > > > won't be as bad. But I guess option 1) is better... > > > > The best idea I had for handling this sounds similar, which would be to > > convert the radix tree locks to essentially be reader/writer locks. I/O and > > faults that don't modify the block mapping could just take read-level locks, > > and could all run concurrently. I/O or faults that modify a block mapping > > would take a write lock, and serialize with other writers and readers. > > Well, this would be difficult to implement inside the radix tree (not > enough bits in the entry) so you'd have to go for some external locking > primitive anyway. And if you do that, read-write range lock Davidlohr has > implemented is what you describe - well we could also have a radix tree > with rwsems but I suspect the overhead of maintaining that would be too > large. It would require larger rewrite than reusing entry locks as I > suggest above though and it isn't an obvious performance win for realistic > workloads either so I'd like to see some performance numbers before going > that way. It likely improves a situation where processes race to fault the > same page for which we already know the block mapping but I'm not sure if > that translates to any measurable performance wins for workloads on DAX > filesystem. > > > You could know if you needed a write lock without asking the filesystem - if > > you're a write and the radix tree entry is empty or is for a zero page, you > > grab the write lock. > > > > This dovetails nicely with the idea of having the radix tree act as a cache > > for block mappings. You take the appropriate lock on the radix tree entry, > > and it has the block mapping info for your I/O or fault so you don't have to > > call into the FS. I/O would also participate so we would keep info about > > block mappings that we gather from I/O to help shortcut our page faults. > > > > How does this sound vs the range lock idea? How hard do you think it would be > > to convert our current wait queue system to reader/writer style locking? > > > > Also, how do you think we should deal with the current PMD corruption? Should > > we go with the current fix (I can augment the comments as you suggested), and > > then handle optimizations to that approach and the solution to this larger > > race as a follow-on? > > So for now I'm still more inclined to just stay with the radix tree lock as > is and just fix up the locking as I suggest and go for larger rewrite only > if we can demonstrate further performance wins. Sounds good. > WRT your second patch, if we go with the locking as I suggest, it is enough > to unmap the whole range after invalidate_inode_pages2() has cleared radix > tree entries (*) which will be much cheaper (for large writes) than doing > unmapping entry by entry. I'm still not convinced that it is safe to do the unmap in a separate step. I see your point about it being expensive to do a rmap walk to unmap each entry in __dax_invalidate_mapping_entry(), but I think we might need to because the unmap is part of the contract imposed by invalidate_inode_pages2_range() and invalidate_inode_pages2(). This exists in the header comment above each: * Any pages which are found to be mapped into pagetables are unmapped prior * to invalidation. If you look at the usage of invalidate_inode_pages2_range() in generic_file_direct_write() for example (which I realize we won't call for a DAX inode, but still), I think that it really does rely on the fact that invalidated pages are unmapped, right? If it didn't, and hole pages were mapped, the hole pages could remain mapped while a direct I/O write allocated blocks and then wrote real data. If we really want to unmap the entire range at once, maybe it would have to be done in invalidate_inode_pages2_range(), after the loop? My hesitation about this is that we'd be leaking yet more DAX special casing up into the mm/truncate.c code. Or am I missing something? > So I'd go for that. I'll prepare a patch for the > locking change - it will require changes to ext4 transaction handling so it > won't be completely trivial. > > (*) The flow of information is: filesystem block mapping info -> radix tree > -> page tables so if 'filesystem block mapping info' changes, we should go > invalidate corresponding radix tree entries (new entries will already have > uptodate info) and then invalidate corresponding page tables (again once > radix tree has no stale entries, we are sure new page table entries will be > uptodate). > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads 2017-05-01 22:38 ` Ross Zwisler @ 2017-05-04 9:12 ` Jan Kara 0 siblings, 0 replies; 17+ messages in thread From: Jan Kara @ 2017-05-04 9:12 UTC (permalink / raw) To: Ross Zwisler Cc: Jan Kara, Andrew Morton, linux-kernel, Alexander Viro, Alexey Kuznetsov, Andrey Ryabinin, Anna Schumaker, Christoph Hellwig, Dan Williams, Darrick J. Wong, Eric Van Hensbergen, Jens Axboe, Johannes Weiner, Konrad Rzeszutek Wilk, Latchesar Ionkov, linux-cifs, linux-fsdevel, linux-mm, linux-nfs, linux-nvdimm, Matthew Wilcox, Ron Minnich, samba-technical, Steve French, Trond Myklebust, v9fs-developer On Mon 01-05-17 16:38:55, Ross Zwisler wrote: > > So for now I'm still more inclined to just stay with the radix tree lock as > > is and just fix up the locking as I suggest and go for larger rewrite only > > if we can demonstrate further performance wins. > > Sounds good. > > > WRT your second patch, if we go with the locking as I suggest, it is enough > > to unmap the whole range after invalidate_inode_pages2() has cleared radix > > tree entries (*) which will be much cheaper (for large writes) than doing > > unmapping entry by entry. > > I'm still not convinced that it is safe to do the unmap in a separate step. I > see your point about it being expensive to do a rmap walk to unmap each entry > in __dax_invalidate_mapping_entry(), but I think we might need to because the > unmap is part of the contract imposed by invalidate_inode_pages2_range() and > invalidate_inode_pages2(). This exists in the header comment above each: > > * Any pages which are found to be mapped into pagetables are unmapped prior > * to invalidation. > > If you look at the usage of invalidate_inode_pages2_range() in > generic_file_direct_write() for example (which I realize we won't call for a > DAX inode, but still), I think that it really does rely on the fact that > invalidated pages are unmapped, right? If it didn't, and hole pages were > mapped, the hole pages could remain mapped while a direct I/O write allocated > blocks and then wrote real data. > > If we really want to unmap the entire range at once, maybe it would have to be > done in invalidate_inode_pages2_range(), after the loop? My hesitation about > this is that we'd be leaking yet more DAX special casing up into the > mm/truncate.c code. > > Or am I missing something? No, my thinking was to put the invalidation at the end of invalidate_inode_pages2_range(). I agree it means more special-casing for DAX in mm/truncate.c. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads 2017-04-27 7:26 ` Jan Kara 2017-05-01 22:38 ` Ross Zwisler @ 2017-05-01 22:59 ` Dan Williams 1 sibling, 0 replies; 17+ messages in thread From: Dan Williams @ 2017-05-01 22:59 UTC (permalink / raw) To: Jan Kara Cc: Latchesar Ionkov, Trond Myklebust, Linux MM, Christoph Hellwig, linux-cifs, Matthew Wilcox, Andrey Ryabinin, Eric Van Hensbergen, linux-nvdimm@lists.01.org, Alexander Viro, v9fs-developer, Jens Axboe, linux-nfs, Darrick J. Wong, samba-technical, linux-kernel@vger.kernel.org, Steve French, Alexey Kuznetsov, Johannes Weiner, linux-fsdevel, Ron Minnich, Andrew Morton, Anna Schumaker On Thu, Apr 27, 2017 at 12:26 AM, Jan Kara <jack@suse.cz> wrote: > On Wed 26-04-17 16:52:36, Ross Zwisler wrote: >> On Wed, Apr 26, 2017 at 10:52:35AM +0200, Jan Kara wrote: >> > On Tue 25-04-17 16:59:36, Ross Zwisler wrote: >> > > On Tue, Apr 25, 2017 at 01:10:43PM +0200, Jan Kara wrote: >> > > <> >> > > > Hum, but now thinking more about it I have hard time figuring out why write >> > > > vs fault cannot actually still race: >> > > > >> > > > CPU1 - write(2) CPU2 - read fault >> > > > >> > > > dax_iomap_pte_fault() >> > > > ->iomap_begin() - sees hole >> > > > dax_iomap_rw() >> > > > iomap_apply() >> > > > ->iomap_begin - allocates blocks >> > > > dax_iomap_actor() >> > > > invalidate_inode_pages2_range() >> > > > - there's nothing to invalidate >> > > > grab_mapping_entry() >> > > > - we add zero page in the radix >> > > > tree & map it to page tables >> > > > >> > > > Similarly read vs write fault may end up racing in a wrong way and try to >> > > > replace already existing exceptional entry with a hole page? >> > > >> > > Yep, this race seems real to me, too. This seems very much like the issues >> > > that exist when a thread is doing direct I/O. One thread is doing I/O to an >> > > intermediate buffer (page cache for direct I/O case, zero page for us), and >> > > the other is going around it directly to media, and they can get out of sync. >> > > >> > > IIRC the direct I/O code looked something like: >> > > >> > > 1/ invalidate existing mappings >> > > 2/ do direct I/O to media >> > > 3/ invalidate mappings again, just in case. Should be cheap if there weren't >> > > any conflicting faults. This makes sure any new allocations we made are >> > > faulted in. >> > >> > Yeah, the problem is people generally expect weird behavior when they mix >> > direct and buffered IO (or let alone mmap) however everyone expects >> > standard read(2) and write(2) to be completely coherent with mmap(2). >> >> Yep, fair enough. >> >> > > I guess one option would be to replicate that logic in the DAX I/O path, or we >> > > could try and enhance our locking so page faults can't race with I/O since >> > > both can allocate blocks. >> > >> > In the abstract way, the problem is that we have radix tree (and page >> > tables) cache block mapping information and the operation: "read block >> > mapping information, store it in the radix tree" is not serialized in any >> > way against other block allocations so the information we store can be out >> > of date by the time we store it. >> > >> > One way to solve this would be to move ->iomap_begin call in the fault >> > paths under entry lock although that would mean I have to redo how ext4 >> > handles DAX faults because with current code it would create lock inversion >> > wrt transaction start. >> >> I don't think this alone is enough to save us. The I/O path doesn't currently >> take any DAX radix tree entry locks, so our race would just become: >> >> CPU1 - write(2) CPU2 - read fault >> >> dax_iomap_pte_fault() >> grab_mapping_entry() // newly moved >> ->iomap_begin() - sees hole >> dax_iomap_rw() >> iomap_apply() >> ->iomap_begin - allocates blocks >> dax_iomap_actor() >> invalidate_inode_pages2_range() >> - there's nothing to invalidate >> - we add zero page in the radix >> tree & map it to page tables >> >> In their current form I don't think we want to take DAX radix tree entry locks >> in the I/O path because that would effectively serialize I/O over a given >> radix tree entry. For a 2MiB entry, for example, all I/O to that 2MiB range >> would be serialized. > > Note that invalidate_inode_pages2_range() will see the entry created by > grab_mapping_entry() on CPU2 and block waiting for its lock and this is > exactly what stops the race. The invalidate_inode_pages2_range() > effectively makes sure there isn't any page fault in progress for given > range... > > Also note that writes to a file are serialized by i_rwsem anyway (and at > least serialization of writes to the overlapping range is required by POSIX) > so this doesn't add any more serialization than we already have. > >> > Another solution would be to grab i_mmap_sem for write when doing write >> > fault of a page and similarly have it grabbed for writing when doing >> > write(2). This would scale rather poorly but if we later replaced it with a >> > range lock (Davidlohr has already posted a nice implementation of it) it >> > won't be as bad. But I guess option 1) is better... >> >> The best idea I had for handling this sounds similar, which would be to >> convert the radix tree locks to essentially be reader/writer locks. I/O and >> faults that don't modify the block mapping could just take read-level locks, >> and could all run concurrently. I/O or faults that modify a block mapping >> would take a write lock, and serialize with other writers and readers. > > Well, this would be difficult to implement inside the radix tree (not > enough bits in the entry) so you'd have to go for some external locking > primitive anyway. And if you do that, read-write range lock Davidlohr has > implemented is what you describe - well we could also have a radix tree > with rwsems but I suspect the overhead of maintaining that would be too > large. It would require larger rewrite than reusing entry locks as I > suggest above though and it isn't an obvious performance win for realistic > workloads either so I'd like to see some performance numbers before going > that way. It likely improves a situation where processes race to fault the > same page for which we already know the block mapping but I'm not sure if > that translates to any measurable performance wins for workloads on DAX > filesystem. I'm also concerned about inventing new / fancy radix infrastructure when we're already in the space of needing struct page for any non-trivial usage of dax. As Kirill's transparent-huge-page page cache implementation matures I'd be interested in looking at a transition path away from radix locking towards something that it shared with the common case page cache locking. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/2] xfs: fix incorrect argument count check 2017-04-21 3:44 ` [PATCH 1/2] dax: prevent invalidation of mapped DAX entries Ross Zwisler 2017-04-21 3:44 ` [PATCH 2/2] dax: fix data corruption due to stale mmap reads Ross Zwisler @ 2017-04-24 17:49 ` Ross Zwisler 2017-04-24 17:49 ` [PATCH 2/2] dax: add regression test for stale mmap reads Ross Zwisler 2017-04-25 10:10 ` [PATCH 1/2] dax: prevent invalidation of mapped DAX entries Jan Kara 2 siblings, 1 reply; 17+ messages in thread From: Ross Zwisler @ 2017-04-24 17:49 UTC (permalink / raw) To: fstests, Xiong Zhou, jmoyer, eguan Cc: Jan Kara, Andrew Morton, Darrick J. Wong, linux-nvdimm, Christoph Hellwig, linux-mm, linux-fsdevel t_mmap_dio.c actually requires 4 arguments, not 3 as the current check enforces: usage: t_mmap_dio <src file> <dest file> <size> <msg> open src(No such file or directory) len 0 (null) Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Fixes: 456581661b4d ("xfs: test per-inode DAX flag by IO") --- src/t_mmap_dio.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/t_mmap_dio.c b/src/t_mmap_dio.c index 69b9ca8..6c8ca1a 100644 --- a/src/t_mmap_dio.c +++ b/src/t_mmap_dio.c @@ -39,7 +39,7 @@ int main(int argc, char **argv) char *dfile; unsigned long len, opt; - if (argc < 4) + if (argc < 5) usage(basename(argv[0])); while ((opt = getopt(argc, argv, "b")) != -1) -- 2.9.3 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 2/2] dax: add regression test for stale mmap reads 2017-04-24 17:49 ` [PATCH 1/2] xfs: fix incorrect argument count check Ross Zwisler @ 2017-04-24 17:49 ` Ross Zwisler 2017-04-25 11:27 ` Eryu Guan 0 siblings, 1 reply; 17+ messages in thread From: Ross Zwisler @ 2017-04-24 17:49 UTC (permalink / raw) To: fstests, Xiong Zhou, jmoyer, eguan Cc: Jan Kara, Andrew Morton, Darrick J. Wong, linux-nvdimm, Christoph Hellwig, linux-mm, linux-fsdevel This adds a regression test for the following kernel patch: dax: fix data corruption due to stale mmap reads The above patch fixes an issue where users of DAX can suffer data corruption from stale mmap reads via the following sequence: - open an mmap over a 2MiB hole - read from a 2MiB hole, faulting in a 2MiB zero page - write to the hole with write(3p). The write succeeds but we incorrectly leave the 2MiB zero page mapping intact. - via the mmap, read the data that was just written. Since the zero page mapping is still intact we read back zeroes instead of the new data. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> --- .gitignore | 1 + src/Makefile | 2 +- src/t_dax_stale_pmd.c | 56 ++++++++++++++++++++++++++++++++++++++++++ tests/generic/427 | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++ tests/generic/427.out | 2 ++ tests/generic/group | 1 + 6 files changed, 129 insertions(+), 1 deletion(-) create mode 100644 src/t_dax_stale_pmd.c create mode 100755 tests/generic/427 create mode 100644 tests/generic/427.out diff --git a/.gitignore b/.gitignore index ded4a61..9664dc9 100644 --- a/.gitignore +++ b/.gitignore @@ -134,6 +134,7 @@ /src/renameat2 /src/t_rename_overwrite /src/t_mmap_dio +/src/t_dax_stale_pmd # dmapi/ binaries /dmapi/src/common/cmd/read_invis diff --git a/src/Makefile b/src/Makefile index abfd873..7e22b50 100644 --- a/src/Makefile +++ b/src/Makefile @@ -12,7 +12,7 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \ godown resvtest writemod makeextents itrash rename \ multi_open_unlink dmiperf unwritten_sync genhashnames t_holes \ t_mmap_writev t_truncate_cmtime dirhash_collide t_rename_overwrite \ - holetest t_truncate_self t_mmap_dio af_unix + holetest t_truncate_self t_mmap_dio af_unix t_dax_stale_pmd LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \ preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \ diff --git a/src/t_dax_stale_pmd.c b/src/t_dax_stale_pmd.c new file mode 100644 index 0000000..d0016eb --- /dev/null +++ b/src/t_dax_stale_pmd.c @@ -0,0 +1,56 @@ +#include <errno.h> +#include <fcntl.h> +#include <libgen.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/mman.h> +#include <sys/stat.h> +#include <sys/types.h> +#include <unistd.h> + +#define MiB(a) ((a)*1024*1024) + +void err_exit(char *op) +{ + fprintf(stderr, "%s: %s\n", op, strerror(errno)); + exit(1); +} + +int main(int argc, char *argv[]) +{ + volatile int a __attribute__((__unused__)); + char *buffer = "HELLO WORLD!"; + char *data; + int fd; + + if (argc < 2) { + printf("Usage: %s <pmem file>\n", basename(argv[0])); + exit(0); + } + + fd = open(argv[1], O_RDWR); + if (fd < 0) + err_exit("fd"); + + data = mmap(NULL, MiB(2), PROT_READ, MAP_SHARED, fd, MiB(2)); + + /* + * This faults in a 2MiB zero page to satisfy the read. + * 'a' is volatile so this read doesn't get optimized out. + */ + a = data[0]; + + pwrite(fd, buffer, strlen(buffer), MiB(2)); + + /* + * Try and use the mmap to read back the data we just wrote with + * pwrite(). If the kernel bug is present the mapping from the 2MiB + * zero page will still be intact, and we'll read back zeros instead. + */ + if (strncmp(buffer, data, strlen(buffer))) + err_exit("strncmp mismatch!"); + + close(fd); + return 0; +} diff --git a/tests/generic/427 b/tests/generic/427 new file mode 100755 index 0000000..baf1099 --- /dev/null +++ b/tests/generic/427 @@ -0,0 +1,68 @@ +#! /bin/bash +# FS QA Test 427 +# +# This is a regression test for kernel patch: +# dax: fix data corruption due to stale mmap reads +# created by Ross Zwisler <ross.zwisler@linux.intel.com> +# +#----------------------------------------------------------------------- +# Copyright (c) 2017 Intel Corporation. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#----------------------------------------------------------------------- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# remove previous $seqres.full before test +rm -f $seqres.full + +# Modify as appropriate. +_supported_fs generic +_supported_os Linux +_require_scratch_dax +_require_test_program "t_dax_stale_pmd" +_require_user + +# real QA test starts here +_scratch_mkfs >>$seqres.full 2>&1 +_scratch_mount "-o dax" + +$XFS_IO_PROG -f -c "falloc 0 4M" $SCRATCH_MNT/testfile >> $seqres.full 2>&1 +chmod 0644 $SCRATCH_MNT/testfile +chown $qa_user $SCRATCH_MNT/testfile + +_user_do "src/t_dax_stale_pmd $SCRATCH_MNT/testfile" + +# success, all done +echo "Silence is golden" +status=0 +exit diff --git a/tests/generic/427.out b/tests/generic/427.out new file mode 100644 index 0000000..61295e5 --- /dev/null +++ b/tests/generic/427.out @@ -0,0 +1,2 @@ +QA output created by 427 +Silence is golden diff --git a/tests/generic/group b/tests/generic/group index f29009c..06f6e9d 100644 --- a/tests/generic/group +++ b/tests/generic/group @@ -429,3 +429,4 @@ 424 auto quick 425 auto quick attr 426 auto quick exportfs +427 auto quick -- 2.9.3 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] dax: add regression test for stale mmap reads 2017-04-24 17:49 ` [PATCH 2/2] dax: add regression test for stale mmap reads Ross Zwisler @ 2017-04-25 11:27 ` Eryu Guan 2017-04-25 20:39 ` Ross Zwisler 0 siblings, 1 reply; 17+ messages in thread From: Eryu Guan @ 2017-04-25 11:27 UTC (permalink / raw) To: Ross Zwisler Cc: Jan Kara, Andrew Morton, Darrick J. Wong, fstests, linux-mm, linux-fsdevel, Christoph Hellwig, linux-nvdimm On Mon, Apr 24, 2017 at 11:49:32AM -0600, Ross Zwisler wrote: > This adds a regression test for the following kernel patch: > > dax: fix data corruption due to stale mmap reads > Seems that this patch hasn't been merged into linus tree, thus 4.11-rc8 kernel should fail this test, but it passed for me, tested with 4.11-rc8 kernel on both ext4 and xfs, with both brd devices and pmem devices created from "memmap=10G!5G memmap=15G!15G" kernel boot command line. Did I miss anything? # ./check -s ext4_pmem_4k generic/427 SECTION -- ext4_pmem_4k RECREATING -- ext4 on /dev/pmem0 FSTYP -- ext4 PLATFORM -- Linux/x86_64 hp-dl360g9-15 4.11.0-rc8.kasan MKFS_OPTIONS -- -b 4096 /dev/pmem1 MOUNT_OPTIONS -- -o acl,user_xattr -o context=system_u:object_r:root_t:s0 /dev/pmem1 /scratch generic/427 1s ... 1s Ran: generic/427 Passed all 1 tests Some comments inline. > The above patch fixes an issue where users of DAX can suffer data > corruption from stale mmap reads via the following sequence: > > - open an mmap over a 2MiB hole > > - read from a 2MiB hole, faulting in a 2MiB zero page > > - write to the hole with write(3p). The write succeeds but we incorrectly > leave the 2MiB zero page mapping intact. > > - via the mmap, read the data that was just written. Since the zero page > mapping is still intact we read back zeroes instead of the new data. > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > --- > .gitignore | 1 + > src/Makefile | 2 +- > src/t_dax_stale_pmd.c | 56 ++++++++++++++++++++++++++++++++++++++++++ > tests/generic/427 | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++ > tests/generic/427.out | 2 ++ > tests/generic/group | 1 + > 6 files changed, 129 insertions(+), 1 deletion(-) > create mode 100644 src/t_dax_stale_pmd.c > create mode 100755 tests/generic/427 > create mode 100644 tests/generic/427.out > > diff --git a/.gitignore b/.gitignore > index ded4a61..9664dc9 100644 > --- a/.gitignore > +++ b/.gitignore > @@ -134,6 +134,7 @@ > /src/renameat2 > /src/t_rename_overwrite > /src/t_mmap_dio > +/src/t_dax_stale_pmd > > # dmapi/ binaries > /dmapi/src/common/cmd/read_invis > diff --git a/src/Makefile b/src/Makefile > index abfd873..7e22b50 100644 > --- a/src/Makefile > +++ b/src/Makefile > @@ -12,7 +12,7 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \ > godown resvtest writemod makeextents itrash rename \ > multi_open_unlink dmiperf unwritten_sync genhashnames t_holes \ > t_mmap_writev t_truncate_cmtime dirhash_collide t_rename_overwrite \ > - holetest t_truncate_self t_mmap_dio af_unix > + holetest t_truncate_self t_mmap_dio af_unix t_dax_stale_pmd > > LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \ > preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \ > diff --git a/src/t_dax_stale_pmd.c b/src/t_dax_stale_pmd.c > new file mode 100644 > index 0000000..d0016eb > --- /dev/null > +++ b/src/t_dax_stale_pmd.c > @@ -0,0 +1,56 @@ > +#include <errno.h> > +#include <fcntl.h> > +#include <libgen.h> > +#include <stdio.h> > +#include <stdlib.h> > +#include <string.h> > +#include <sys/mman.h> > +#include <sys/stat.h> > +#include <sys/types.h> > +#include <unistd.h> > + > +#define MiB(a) ((a)*1024*1024) > + > +void err_exit(char *op) > +{ > + fprintf(stderr, "%s: %s\n", op, strerror(errno)); > + exit(1); > +} > + > +int main(int argc, char *argv[]) > +{ > + volatile int a __attribute__((__unused__)); > + char *buffer = "HELLO WORLD!"; > + char *data; > + int fd; > + > + if (argc < 2) { > + printf("Usage: %s <pmem file>\n", basename(argv[0])); > + exit(0); > + } > + > + fd = open(argv[1], O_RDWR); > + if (fd < 0) > + err_exit("fd"); ^^^^ Nitpick, the "op" should be "open"? > + > + data = mmap(NULL, MiB(2), PROT_READ, MAP_SHARED, fd, MiB(2)); > + > + /* > + * This faults in a 2MiB zero page to satisfy the read. > + * 'a' is volatile so this read doesn't get optimized out. > + */ > + a = data[0]; > + > + pwrite(fd, buffer, strlen(buffer), MiB(2)); > + > + /* > + * Try and use the mmap to read back the data we just wrote with > + * pwrite(). If the kernel bug is present the mapping from the 2MiB > + * zero page will still be intact, and we'll read back zeros instead. > + */ > + if (strncmp(buffer, data, strlen(buffer))) > + err_exit("strncmp mismatch!"); strncmp doesn't set errno, this err_exit message might be confusing: "strncmp mismatch!: Success" > + > + close(fd); > + return 0; > +} > diff --git a/tests/generic/427 b/tests/generic/427 > new file mode 100755 > index 0000000..baf1099 > --- /dev/null > +++ b/tests/generic/427 > @@ -0,0 +1,68 @@ > +#! /bin/bash > +# FS QA Test 427 > +# > +# This is a regression test for kernel patch: > +# dax: fix data corruption due to stale mmap reads > +# created by Ross Zwisler <ross.zwisler@linux.intel.com> > +# > +#----------------------------------------------------------------------- > +# Copyright (c) 2017 Intel Corporation. All Rights Reserved. > +# > +# This program is free software; you can redistribute it and/or > +# modify it under the terms of the GNU General Public License as > +# published by the Free Software Foundation. > +# > +# This program is distributed in the hope that it would be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program; if not, write the Free Software Foundation, > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > +#----------------------------------------------------------------------- > +# > + > +seq=`basename $0` > +seqres=$RESULT_DIR/$seq > +echo "QA output created by $seq" > + > +here=`pwd` > +tmp=/tmp/$$ > +status=1 # failure is the default! > +trap "_cleanup; exit \$status" 0 1 2 3 15 > + > +_cleanup() > +{ > + cd / > + rm -f $tmp.* > +} > + > +# get standard environment, filters and checks > +. ./common/rc > +. ./common/filter > + > +# remove previous $seqres.full before test > +rm -f $seqres.full > + > +# Modify as appropriate. > +_supported_fs generic > +_supported_os Linux > +_require_scratch_dax I don't think dax is a requirement here, this test could run on normal block device without "-o dax" option too. It won't hurt to run with more test configurations. And test on nvdimm device with dax mount option could be one of the test configs, e.g. TEST_DEV=/dev/pmem0 SCRATCH_DEV=/dev/pmem1 MOUNT_OPTIONS="-o dax" ... > +_require_test_program "t_dax_stale_pmd" > +_require_user _require_xfs_io_command "falloc" So test _notrun on ext2/3. > + > +# real QA test starts here > +_scratch_mkfs >>$seqres.full 2>&1 > +_scratch_mount "-o dax" Same here, dax is not required. > + > +$XFS_IO_PROG -f -c "falloc 0 4M" $SCRATCH_MNT/testfile >> $seqres.full 2>&1 > +chmod 0644 $SCRATCH_MNT/testfile > +chown $qa_user $SCRATCH_MNT/testfile Any specific reason to use $qa_user to run this test? Comments would be great. Thanks, Eryu > + > +_user_do "src/t_dax_stale_pmd $SCRATCH_MNT/testfile" > + > +# success, all done > +echo "Silence is golden" > +status=0 > +exit > diff --git a/tests/generic/427.out b/tests/generic/427.out > new file mode 100644 > index 0000000..61295e5 > --- /dev/null > +++ b/tests/generic/427.out > @@ -0,0 +1,2 @@ > +QA output created by 427 > +Silence is golden > diff --git a/tests/generic/group b/tests/generic/group > index f29009c..06f6e9d 100644 > --- a/tests/generic/group > +++ b/tests/generic/group > @@ -429,3 +429,4 @@ > 424 auto quick > 425 auto quick attr > 426 auto quick exportfs > +427 auto quick > -- > 2.9.3 > _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] dax: add regression test for stale mmap reads 2017-04-25 11:27 ` Eryu Guan @ 2017-04-25 20:39 ` Ross Zwisler 2017-04-26 3:42 ` Eryu Guan 0 siblings, 1 reply; 17+ messages in thread From: Ross Zwisler @ 2017-04-25 20:39 UTC (permalink / raw) To: Eryu Guan Cc: Jan Kara, Andrew Morton, Darrick J. Wong, fstests, Christoph Hellwig, linux-mm, linux-fsdevel, linux-nvdimm On Tue, Apr 25, 2017 at 07:27:39PM +0800, Eryu Guan wrote: > On Mon, Apr 24, 2017 at 11:49:32AM -0600, Ross Zwisler wrote: > > This adds a regression test for the following kernel patch: > > > > dax: fix data corruption due to stale mmap reads > > > > Seems that this patch hasn't been merged into linus tree, thus 4.11-rc8 > kernel should fail this test, but it passed for me, tested with 4.11-rc8 > kernel on both ext4 and xfs, with both brd devices and pmem devices > created from "memmap=10G!5G memmap=15G!15G" kernel boot command line. > Did I miss anything? > > # ./check -s ext4_pmem_4k generic/427 > SECTION -- ext4_pmem_4k Ooh, I didn't add this 'ext4_pmem_4k' section goodness, and it's not present in the xfstests/master that I was using. Do you have patches to add that? > RECREATING -- ext4 on /dev/pmem0 > FSTYP -- ext4 > PLATFORM -- Linux/x86_64 hp-dl360g9-15 4.11.0-rc8.kasan > MKFS_OPTIONS -- -b 4096 /dev/pmem1 > MOUNT_OPTIONS -- -o acl,user_xattr -o context=system_u:object_r:root_t:s0 /dev/pmem1 /scratch > > generic/427 1s ... 1s > Ran: generic/427 > Passed all 1 tests Your memmap params look fine. I tested with BRD and PMEM, and with EXT4 and XFS, and all combinations failed for me as expected with v4.11-rc8. One issue could have been that the test file already existed when the test was run. I wasn't removing it between runs earlier, but I've fixed that for v2. Another issue I guess could have been that the hole that we got back from the filesystem was smaller than 2MiB? Can you try running v2 (which I'll post in a second) against a TEST_DEV made with one of the following: ext4: mkfs.ext4 -b 4096 -E stride=512 -F $TEST_DEV xfs: mkfs.xfs -f -d su=2m,sw=1 $TEST_DEV This helps us get 2MiB sized and aligned allocations so we can fault in PMDs, but I'm not sure whether or not it would matter for holes. > Some comments inline. > > > The above patch fixes an issue where users of DAX can suffer data > > corruption from stale mmap reads via the following sequence: > > > > - open an mmap over a 2MiB hole > > > > - read from a 2MiB hole, faulting in a 2MiB zero page > > > > - write to the hole with write(3p). The write succeeds but we incorrectly > > leave the 2MiB zero page mapping intact. > > > > - via the mmap, read the data that was just written. Since the zero page > > mapping is still intact we read back zeroes instead of the new data. > > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > > --- > > .gitignore | 1 + > > src/Makefile | 2 +- > > src/t_dax_stale_pmd.c | 56 ++++++++++++++++++++++++++++++++++++++++++ > > tests/generic/427 | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++ > > tests/generic/427.out | 2 ++ > > tests/generic/group | 1 + > > 6 files changed, 129 insertions(+), 1 deletion(-) > > create mode 100644 src/t_dax_stale_pmd.c > > create mode 100755 tests/generic/427 > > create mode 100644 tests/generic/427.out > > > > diff --git a/.gitignore b/.gitignore > > index ded4a61..9664dc9 100644 > > --- a/.gitignore > > +++ b/.gitignore > > @@ -134,6 +134,7 @@ > > /src/renameat2 > > /src/t_rename_overwrite > > /src/t_mmap_dio > > +/src/t_dax_stale_pmd > > > > # dmapi/ binaries > > /dmapi/src/common/cmd/read_invis > > diff --git a/src/Makefile b/src/Makefile > > index abfd873..7e22b50 100644 > > --- a/src/Makefile > > +++ b/src/Makefile > > @@ -12,7 +12,7 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \ > > godown resvtest writemod makeextents itrash rename \ > > multi_open_unlink dmiperf unwritten_sync genhashnames t_holes \ > > t_mmap_writev t_truncate_cmtime dirhash_collide t_rename_overwrite \ > > - holetest t_truncate_self t_mmap_dio af_unix > > + holetest t_truncate_self t_mmap_dio af_unix t_dax_stale_pmd > > > > LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \ > > preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \ > > diff --git a/src/t_dax_stale_pmd.c b/src/t_dax_stale_pmd.c > > new file mode 100644 > > index 0000000..d0016eb > > --- /dev/null > > +++ b/src/t_dax_stale_pmd.c > > @@ -0,0 +1,56 @@ > > +#include <errno.h> > > +#include <fcntl.h> > > +#include <libgen.h> > > +#include <stdio.h> > > +#include <stdlib.h> > > +#include <string.h> > > +#include <sys/mman.h> > > +#include <sys/stat.h> > > +#include <sys/types.h> > > +#include <unistd.h> > > + > > +#define MiB(a) ((a)*1024*1024) > > + > > +void err_exit(char *op) > > +{ > > + fprintf(stderr, "%s: %s\n", op, strerror(errno)); > > + exit(1); > > +} > > + > > +int main(int argc, char *argv[]) > > +{ > > + volatile int a __attribute__((__unused__)); > > + char *buffer = "HELLO WORLD!"; > > + char *data; > > + int fd; > > + > > + if (argc < 2) { > > + printf("Usage: %s <pmem file>\n", basename(argv[0])); > > + exit(0); > > + } > > + > > + fd = open(argv[1], O_RDWR); > > + if (fd < 0) > > + err_exit("fd"); > ^^^^ Nitpick, the "op" should be "open"? > > + > > + data = mmap(NULL, MiB(2), PROT_READ, MAP_SHARED, fd, MiB(2)); > > + > > + /* > > + * This faults in a 2MiB zero page to satisfy the read. > > + * 'a' is volatile so this read doesn't get optimized out. > > + */ > > + a = data[0]; > > + > > + pwrite(fd, buffer, strlen(buffer), MiB(2)); > > + > > + /* > > + * Try and use the mmap to read back the data we just wrote with > > + * pwrite(). If the kernel bug is present the mapping from the 2MiB > > + * zero page will still be intact, and we'll read back zeros instead. > > + */ > > + if (strncmp(buffer, data, strlen(buffer))) > > + err_exit("strncmp mismatch!"); > > strncmp doesn't set errno, this err_exit message might be confusing: > "strncmp mismatch!: Success" Ah, thanks, fixed in v2. > > + > > + close(fd); > > + return 0; > > +} > > diff --git a/tests/generic/427 b/tests/generic/427 > > new file mode 100755 > > index 0000000..baf1099 > > --- /dev/null > > +++ b/tests/generic/427 > > @@ -0,0 +1,68 @@ > > +#! /bin/bash > > +# FS QA Test 427 > > +# > > +# This is a regression test for kernel patch: > > +# dax: fix data corruption due to stale mmap reads > > +# created by Ross Zwisler <ross.zwisler@linux.intel.com> > > +# > > +#----------------------------------------------------------------------- > > +# Copyright (c) 2017 Intel Corporation. All Rights Reserved. > > +# > > +# This program is free software; you can redistribute it and/or > > +# modify it under the terms of the GNU General Public License as > > +# published by the Free Software Foundation. > > +# > > +# This program is distributed in the hope that it would be useful, > > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > > +# GNU General Public License for more details. > > +# > > +# You should have received a copy of the GNU General Public License > > +# along with this program; if not, write the Free Software Foundation, > > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > > +#----------------------------------------------------------------------- > > +# > > + > > +seq=`basename $0` > > +seqres=$RESULT_DIR/$seq > > +echo "QA output created by $seq" > > + > > +here=`pwd` > > +tmp=/tmp/$$ > > +status=1 # failure is the default! > > +trap "_cleanup; exit \$status" 0 1 2 3 15 > > + > > +_cleanup() > > +{ > > + cd / > > + rm -f $tmp.* > > +} > > + > > +# get standard environment, filters and checks > > +. ./common/rc > > +. ./common/filter > > + > > +# remove previous $seqres.full before test > > +rm -f $seqres.full > > + > > +# Modify as appropriate. > > +_supported_fs generic > > +_supported_os Linux > > +_require_scratch_dax > > I don't think dax is a requirement here, this test could run on normal > block device without "-o dax" option too. It won't hurt to run with more > test configurations. And test on nvdimm device with dax mount option > could be one of the test configs, e.g. > > TEST_DEV=/dev/pmem0 > SCRATCH_DEV=/dev/pmem1 > MOUNT_OPTIONS="-o dax" > ... Yep, agreed, fixed in v2. > > +_require_test_program "t_dax_stale_pmd" > > +_require_user > > _require_xfs_io_command "falloc" > > So test _notrun on ext2/3. Fixed in v2. > > + > > +# real QA test starts here > > +_scratch_mkfs >>$seqres.full 2>&1 > > +_scratch_mount "-o dax" > > Same here, dax is not required. Fixed in v2. > > > + > > +$XFS_IO_PROG -f -c "falloc 0 4M" $SCRATCH_MNT/testfile >> $seqres.full 2>&1 > > +chmod 0644 $SCRATCH_MNT/testfile > > +chown $qa_user $SCRATCH_MNT/testfile > > Any specific reason to use $qa_user to run this test? Comments would be > great. Nope, just cargo-culting my way through my first xfstest. :) I've removed this for v2. > Thanks, > Eryu Thanks for the review! > > + > > +_user_do "src/t_dax_stale_pmd $SCRATCH_MNT/testfile" > > + > > +# success, all done > > +echo "Silence is golden" > > +status=0 > > +exit > > diff --git a/tests/generic/427.out b/tests/generic/427.out _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] dax: add regression test for stale mmap reads 2017-04-25 20:39 ` Ross Zwisler @ 2017-04-26 3:42 ` Eryu Guan 0 siblings, 0 replies; 17+ messages in thread From: Eryu Guan @ 2017-04-26 3:42 UTC (permalink / raw) To: Ross Zwisler Cc: Jan Kara, Andrew Morton, Darrick J. Wong, fstests, linux-mm, linux-fsdevel, Christoph Hellwig, linux-nvdimm On Tue, Apr 25, 2017 at 02:39:11PM -0600, Ross Zwisler wrote: > On Tue, Apr 25, 2017 at 07:27:39PM +0800, Eryu Guan wrote: > > On Mon, Apr 24, 2017 at 11:49:32AM -0600, Ross Zwisler wrote: > > > This adds a regression test for the following kernel patch: > > > > > > dax: fix data corruption due to stale mmap reads > > > > > > > Seems that this patch hasn't been merged into linus tree, thus 4.11-rc8 > > kernel should fail this test, but it passed for me, tested with 4.11-rc8 > > kernel on both ext4 and xfs, with both brd devices and pmem devices > > created from "memmap=10G!5G memmap=15G!15G" kernel boot command line. > > Did I miss anything? > > > > # ./check -s ext4_pmem_4k generic/427 > > SECTION -- ext4_pmem_4k > > Ooh, I didn't add this 'ext4_pmem_4k' section goodness, and it's not present > in the xfstests/master that I was using. Do you have patches to add that? That's one of my config sections, it's all user-defined, not committed to fstests repo :) You can take a look at README.config-sections for more details. Here is my local.config file for your reference [default] TEST_DEV=/dev/pmem0 SCRATCH_DEV=/dev/pmem1 TEST_DIR=/mnt SCRATCH_MNT=/scratch RECREATE_TEST_DEV=true [xfs_pmem_4k] FSTYP=xfs MKFS_OPTIONS="-f -m crc=1 -b size=4k" [ext4_pmem_4k] FSTYP=ext4 MKFS_OPTIONS="-b 4096" > > > RECREATING -- ext4 on /dev/pmem0 > > FSTYP -- ext4 > > PLATFORM -- Linux/x86_64 hp-dl360g9-15 4.11.0-rc8.kasan > > MKFS_OPTIONS -- -b 4096 /dev/pmem1 > > MOUNT_OPTIONS -- -o acl,user_xattr -o context=system_u:object_r:root_t:s0 /dev/pmem1 /scratch > > > > generic/427 1s ... 1s > > Ran: generic/427 > > Passed all 1 tests > > Your memmap params look fine. I tested with BRD and PMEM, and with EXT4 and > XFS, and all combinations failed for me as expected with v4.11-rc8. > > One issue could have been that the test file already existed when the test was > run. I wasn't removing it between runs earlier, but I've fixed that for v2. > > Another issue I guess could have been that the hole that we got back from the > filesystem was smaller than 2MiB? Can you try running v2 (which I'll post in > a second) against a TEST_DEV made with one of the following: > > ext4: mkfs.ext4 -b 4096 -E stride=512 -F $TEST_DEV > xfs: mkfs.xfs -f -d su=2m,sw=1 $TEST_DEV > > This helps us get 2MiB sized and aligned allocations so we can fault in PMDs, > but I'm not sure whether or not it would matter for holes. I guess that's the point to reproduce the failure, I'll confirm with v2 patches. If these non-default & not widely tested mkfs options are required to reproduce it, I think we can specify them in the test, as extra mkfs options to _scratch_mkfs, as what generic/413 does. > > > Some comments inline. > > > > > The above patch fixes an issue where users of DAX can suffer data > > > corruption from stale mmap reads via the following sequence: > > > > > > - open an mmap over a 2MiB hole > > > > > > - read from a 2MiB hole, faulting in a 2MiB zero page > > > > > > - write to the hole with write(3p). The write succeeds but we incorrectly > > > leave the 2MiB zero page mapping intact. > > > > > > - via the mmap, read the data that was just written. Since the zero page > > > mapping is still intact we read back zeroes instead of the new data. > > > > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > > > --- > > > .gitignore | 1 + > > > src/Makefile | 2 +- > > > src/t_dax_stale_pmd.c | 56 ++++++++++++++++++++++++++++++++++++++++++ > > > tests/generic/427 | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++ > > > tests/generic/427.out | 2 ++ > > > tests/generic/group | 1 + > > > 6 files changed, 129 insertions(+), 1 deletion(-) > > > create mode 100644 src/t_dax_stale_pmd.c > > > create mode 100755 tests/generic/427 > > > create mode 100644 tests/generic/427.out > > > > > > diff --git a/.gitignore b/.gitignore > > > index ded4a61..9664dc9 100644 > > > --- a/.gitignore > > > +++ b/.gitignore > > > @@ -134,6 +134,7 @@ > > > /src/renameat2 > > > /src/t_rename_overwrite > > > /src/t_mmap_dio > > > +/src/t_dax_stale_pmd > > > > > > # dmapi/ binaries > > > /dmapi/src/common/cmd/read_invis > > > diff --git a/src/Makefile b/src/Makefile > > > index abfd873..7e22b50 100644 > > > --- a/src/Makefile > > > +++ b/src/Makefile > > > @@ -12,7 +12,7 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \ > > > godown resvtest writemod makeextents itrash rename \ > > > multi_open_unlink dmiperf unwritten_sync genhashnames t_holes \ > > > t_mmap_writev t_truncate_cmtime dirhash_collide t_rename_overwrite \ > > > - holetest t_truncate_self t_mmap_dio af_unix > > > + holetest t_truncate_self t_mmap_dio af_unix t_dax_stale_pmd > > > > > > LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \ > > > preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \ > > > diff --git a/src/t_dax_stale_pmd.c b/src/t_dax_stale_pmd.c > > > new file mode 100644 > > > index 0000000..d0016eb > > > --- /dev/null > > > +++ b/src/t_dax_stale_pmd.c > > > @@ -0,0 +1,56 @@ > > > +#include <errno.h> > > > +#include <fcntl.h> > > > +#include <libgen.h> > > > +#include <stdio.h> > > > +#include <stdlib.h> > > > +#include <string.h> > > > +#include <sys/mman.h> > > > +#include <sys/stat.h> > > > +#include <sys/types.h> > > > +#include <unistd.h> > > > + > > > +#define MiB(a) ((a)*1024*1024) > > > + > > > +void err_exit(char *op) > > > +{ > > > + fprintf(stderr, "%s: %s\n", op, strerror(errno)); > > > + exit(1); > > > +} > > > + > > > +int main(int argc, char *argv[]) > > > +{ > > > + volatile int a __attribute__((__unused__)); > > > + char *buffer = "HELLO WORLD!"; > > > + char *data; > > > + int fd; > > > + > > > + if (argc < 2) { > > > + printf("Usage: %s <pmem file>\n", basename(argv[0])); > > > + exit(0); > > > + } > > > + > > > + fd = open(argv[1], O_RDWR); > > > + if (fd < 0) > > > + err_exit("fd"); > > ^^^^ Nitpick, the "op" should be "open"? > > > + > > > + data = mmap(NULL, MiB(2), PROT_READ, MAP_SHARED, fd, MiB(2)); > > > + > > > + /* > > > + * This faults in a 2MiB zero page to satisfy the read. > > > + * 'a' is volatile so this read doesn't get optimized out. > > > + */ > > > + a = data[0]; > > > + > > > + pwrite(fd, buffer, strlen(buffer), MiB(2)); > > > + > > > + /* > > > + * Try and use the mmap to read back the data we just wrote with > > > + * pwrite(). If the kernel bug is present the mapping from the 2MiB > > > + * zero page will still be intact, and we'll read back zeros instead. > > > + */ > > > + if (strncmp(buffer, data, strlen(buffer))) > > > + err_exit("strncmp mismatch!"); > > > > strncmp doesn't set errno, this err_exit message might be confusing: > > "strncmp mismatch!: Success" > > Ah, thanks, fixed in v2. > > > > + > > > + close(fd); > > > + return 0; > > > +} > > > diff --git a/tests/generic/427 b/tests/generic/427 > > > new file mode 100755 > > > index 0000000..baf1099 > > > --- /dev/null > > > +++ b/tests/generic/427 > > > @@ -0,0 +1,68 @@ > > > +#! /bin/bash > > > +# FS QA Test 427 > > > +# > > > +# This is a regression test for kernel patch: > > > +# dax: fix data corruption due to stale mmap reads > > > +# created by Ross Zwisler <ross.zwisler@linux.intel.com> > > > +# > > > +#----------------------------------------------------------------------- > > > +# Copyright (c) 2017 Intel Corporation. All Rights Reserved. > > > +# > > > +# This program is free software; you can redistribute it and/or > > > +# modify it under the terms of the GNU General Public License as > > > +# published by the Free Software Foundation. > > > +# > > > +# This program is distributed in the hope that it would be useful, > > > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > > > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > > > +# GNU General Public License for more details. > > > +# > > > +# You should have received a copy of the GNU General Public License > > > +# along with this program; if not, write the Free Software Foundation, > > > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > > > +#----------------------------------------------------------------------- > > > +# > > > + > > > +seq=`basename $0` > > > +seqres=$RESULT_DIR/$seq > > > +echo "QA output created by $seq" > > > + > > > +here=`pwd` > > > +tmp=/tmp/$$ > > > +status=1 # failure is the default! > > > +trap "_cleanup; exit \$status" 0 1 2 3 15 > > > + > > > +_cleanup() > > > +{ > > > + cd / > > > + rm -f $tmp.* > > > +} > > > + > > > +# get standard environment, filters and checks > > > +. ./common/rc > > > +. ./common/filter > > > + > > > +# remove previous $seqres.full before test > > > +rm -f $seqres.full > > > + > > > +# Modify as appropriate. > > > +_supported_fs generic > > > +_supported_os Linux > > > +_require_scratch_dax > > > > I don't think dax is a requirement here, this test could run on normal > > block device without "-o dax" option too. It won't hurt to run with more > > test configurations. And test on nvdimm device with dax mount option > > could be one of the test configs, e.g. > > > > TEST_DEV=/dev/pmem0 > > SCRATCH_DEV=/dev/pmem1 > > MOUNT_OPTIONS="-o dax" > > ... > > Yep, agreed, fixed in v2. Then perhaps the test program should be renamed? As no dax is required. How about t_mmap_stale_pmd? > > > > +_require_test_program "t_dax_stale_pmd" > > > +_require_user > > > > _require_xfs_io_command "falloc" > > > > So test _notrun on ext2/3. > > Fixed in v2. > > > > + > > > +# real QA test starts here > > > +_scratch_mkfs >>$seqres.full 2>&1 > > > +_scratch_mount "-o dax" > > > > Same here, dax is not required. > > Fixed in v2. > > > > > > + > > > +$XFS_IO_PROG -f -c "falloc 0 4M" $SCRATCH_MNT/testfile >> $seqres.full 2>&1 > > > +chmod 0644 $SCRATCH_MNT/testfile > > > +chown $qa_user $SCRATCH_MNT/testfile > > > > Any specific reason to use $qa_user to run this test? Comments would be > > great. > > Nope, just cargo-culting my way through my first xfstest. :) I've removed > this for v2. I think it's in a pretty good shape for "first fstests test" :) > > > Thanks, > > Eryu > > Thanks for the review! Thanks for adding new test! Eryu > > > > + > > > +_user_do "src/t_dax_stale_pmd $SCRATCH_MNT/testfile" > > > + > > > +# success, all done > > > +echo "Silence is golden" > > > +status=0 > > > +exit > > > diff --git a/tests/generic/427.out b/tests/generic/427.out > -- > To unsubscribe from this list: send the line "unsubscribe fstests" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] dax: prevent invalidation of mapped DAX entries 2017-04-21 3:44 ` [PATCH 1/2] dax: prevent invalidation of mapped DAX entries Ross Zwisler 2017-04-21 3:44 ` [PATCH 2/2] dax: fix data corruption due to stale mmap reads Ross Zwisler 2017-04-24 17:49 ` [PATCH 1/2] xfs: fix incorrect argument count check Ross Zwisler @ 2017-04-25 10:10 ` Jan Kara 2017-05-01 16:54 ` Ross Zwisler 2 siblings, 1 reply; 17+ messages in thread From: Jan Kara @ 2017-04-25 10:10 UTC (permalink / raw) To: Ross Zwisler Cc: Latchesar Ionkov, Jan Kara, Trond Myklebust, linux-mm, Christoph Hellwig, linux-cifs, Matthew Wilcox, Andrey Ryabinin, Eric Van Hensbergen, linux-nvdimm, Alexander Viro, v9fs-developer, Jens Axboe, linux-nfs, Darrick J. Wong, samba-technical, linux-kernel, Steve French, Alexey Kuznetsov, Johannes Weiner, linux-fsdevel, Ron Minnich, Andrew Morton, Anna Schumaker On Thu 20-04-17 21:44:36, Ross Zwisler wrote: > dax_invalidate_mapping_entry() currently removes DAX exceptional entries > only if they are clean and unlocked. This is done via: > > invalidate_mapping_pages() > invalidate_exceptional_entry() > dax_invalidate_mapping_entry() > > However, for page cache pages removed in invalidate_mapping_pages() there > is an additional criteria which is that the page must not be mapped. This > is noted in the comments above invalidate_mapping_pages() and is checked in > invalidate_inode_page(). > > For DAX entries this means that we can can end up in a situation where a > DAX exceptional entry, either a huge zero page or a regular DAX entry, > could end up mapped but without an associated radix tree entry. This is > inconsistent with the rest of the DAX code and with what happens in the > page cache case. > > We aren't able to unmap the DAX exceptional entry because according to its > comments invalidate_mapping_pages() isn't allowed to block, and > unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem. > > Since we essentially never have unmapped DAX entries to evict from the > radix tree, just remove dax_invalidate_mapping_entry(). > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate") > Reported-by: Jan Kara <jack@suse.cz> > Cc: <stable@vger.kernel.org> [4.10+] Just as a side note - we wouldn't really have to unmap the mapping range covered by the DAX exceptional entry. It would be enough to find out whether such range is mapped and bail out in that case. But that would still be pretty expensive for DAX - we'd have to do rmap walk similar as in dax_mapping_entry_mkclean() and IMHO it is not worth it. So I agree with what you did. You can add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > --- > > This series applies cleanly to the current v4.11-rc7 based linux/master, > and has passed an xfstests run with DAX on ext4 and XFS. > > These patches also apply to v4.10.9 with a little work from the 3-way > merge feature. > > fs/dax.c | 29 ----------------------------- > include/linux/dax.h | 1 - > mm/truncate.c | 9 +++------ > 3 files changed, 3 insertions(+), 36 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 85abd74..166504c 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -507,35 +507,6 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index) > } > > /* > - * Invalidate exceptional DAX entry if easily possible. This handles DAX > - * entries for invalidate_inode_pages() so we evict the entry only if we can > - * do so without blocking. > - */ > -int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index) > -{ > - int ret = 0; > - void *entry, **slot; > - struct radix_tree_root *page_tree = &mapping->page_tree; > - > - spin_lock_irq(&mapping->tree_lock); > - entry = __radix_tree_lookup(page_tree, index, NULL, &slot); > - if (!entry || !radix_tree_exceptional_entry(entry) || > - slot_locked(mapping, slot)) > - goto out; > - if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) || > - radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) > - goto out; > - radix_tree_delete(page_tree, index); > - mapping->nrexceptional--; > - ret = 1; > -out: > - spin_unlock_irq(&mapping->tree_lock); > - if (ret) > - dax_wake_mapping_entry_waiter(mapping, index, entry, true); > - return ret; > -} > - > -/* > * Invalidate exceptional DAX entry if it is clean. > */ > int dax_invalidate_mapping_entry_sync(struct address_space *mapping, > diff --git a/include/linux/dax.h b/include/linux/dax.h > index d8a3dc0..f8e1833 100644 > --- a/include/linux/dax.h > +++ b/include/linux/dax.h > @@ -41,7 +41,6 @@ ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter, > int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size, > const struct iomap_ops *ops); > int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index); > -int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index); > int dax_invalidate_mapping_entry_sync(struct address_space *mapping, > pgoff_t index); > void dax_wake_mapping_entry_waiter(struct address_space *mapping, > diff --git a/mm/truncate.c b/mm/truncate.c > index 6263aff..c537184 100644 > --- a/mm/truncate.c > +++ b/mm/truncate.c > @@ -67,17 +67,14 @@ static void truncate_exceptional_entry(struct address_space *mapping, > > /* > * Invalidate exceptional entry if easily possible. This handles exceptional > - * entries for invalidate_inode_pages() so for DAX it evicts only unlocked and > - * clean entries. > + * entries for invalidate_inode_pages(). > */ > static int invalidate_exceptional_entry(struct address_space *mapping, > pgoff_t index, void *entry) > { > - /* Handled by shmem itself */ > - if (shmem_mapping(mapping)) > + /* Handled by shmem itself, or for DAX we do nothing. */ > + if (shmem_mapping(mapping) || dax_mapping(mapping)) > return 1; > - if (dax_mapping(mapping)) > - return dax_invalidate_mapping_entry(mapping, index); > clear_shadow_entry(mapping, index, entry); > return 1; > } > -- > 2.9.3 > -- Jan Kara <jack@suse.com> SUSE Labs, CR _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] dax: prevent invalidation of mapped DAX entries 2017-04-25 10:10 ` [PATCH 1/2] dax: prevent invalidation of mapped DAX entries Jan Kara @ 2017-05-01 16:54 ` Ross Zwisler 0 siblings, 0 replies; 17+ messages in thread From: Ross Zwisler @ 2017-05-01 16:54 UTC (permalink / raw) To: Jan Kara Cc: Latchesar Ionkov, Trond Myklebust, linux-mm, Christoph Hellwig, linux-cifs, Matthew Wilcox, Andrey Ryabinin, Eric Van Hensbergen, linux-nvdimm, Alexander Viro, v9fs-developer, Jens Axboe, linux-nfs, Darrick J. Wong, samba-technical, linux-kernel, Steve French, Alexey Kuznetsov, Johannes Weiner, linux-fsdevel, Ron Minnich, Andrew Morton, Anna Schumaker On Tue, Apr 25, 2017 at 12:10:41PM +0200, Jan Kara wrote: > On Thu 20-04-17 21:44:36, Ross Zwisler wrote: > > dax_invalidate_mapping_entry() currently removes DAX exceptional entries > > only if they are clean and unlocked. This is done via: > > > > invalidate_mapping_pages() > > invalidate_exceptional_entry() > > dax_invalidate_mapping_entry() > > > > However, for page cache pages removed in invalidate_mapping_pages() there > > is an additional criteria which is that the page must not be mapped. This > > is noted in the comments above invalidate_mapping_pages() and is checked in > > invalidate_inode_page(). > > > > For DAX entries this means that we can can end up in a situation where a > > DAX exceptional entry, either a huge zero page or a regular DAX entry, > > could end up mapped but without an associated radix tree entry. This is > > inconsistent with the rest of the DAX code and with what happens in the > > page cache case. > > > > We aren't able to unmap the DAX exceptional entry because according to its > > comments invalidate_mapping_pages() isn't allowed to block, and > > unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem. > > > > Since we essentially never have unmapped DAX entries to evict from the > > radix tree, just remove dax_invalidate_mapping_entry(). > > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > > Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate") > > Reported-by: Jan Kara <jack@suse.cz> > > Cc: <stable@vger.kernel.org> [4.10+] > > Just as a side note - we wouldn't really have to unmap the mapping range > covered by the DAX exceptional entry. It would be enough to find out > whether such range is mapped and bail out in that case. But that would > still be pretty expensive for DAX - we'd have to do rmap walk similar as in > dax_mapping_entry_mkclean() and IMHO it is not worth it. So I agree with > what you did. You can add: > > Reviewed-by: Jan Kara <jack@suse.cz> Yep, that makes sense. Thanks for the review. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2017-05-04 9:12 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20170420191446.GA21694@linux.intel.com>
2017-04-21 3:44 ` [PATCH 1/2] dax: prevent invalidation of mapped DAX entries Ross Zwisler
2017-04-21 3:44 ` [PATCH 2/2] dax: fix data corruption due to stale mmap reads Ross Zwisler
2017-04-25 11:10 ` Jan Kara
2017-04-25 22:59 ` Ross Zwisler
2017-04-26 8:52 ` Jan Kara
2017-04-26 22:52 ` Ross Zwisler
2017-04-27 7:26 ` Jan Kara
2017-05-01 22:38 ` Ross Zwisler
2017-05-04 9:12 ` Jan Kara
2017-05-01 22:59 ` Dan Williams
2017-04-24 17:49 ` [PATCH 1/2] xfs: fix incorrect argument count check Ross Zwisler
2017-04-24 17:49 ` [PATCH 2/2] dax: add regression test for stale mmap reads Ross Zwisler
2017-04-25 11:27 ` Eryu Guan
2017-04-25 20:39 ` Ross Zwisler
2017-04-26 3:42 ` Eryu Guan
2017-04-25 10:10 ` [PATCH 1/2] dax: prevent invalidation of mapped DAX entries Jan Kara
2017-05-01 16:54 ` Ross Zwisler
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox