* [PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking @ 2015-11-17 20:15 Dan Williams 2015-11-17 20:15 ` [PATCH 1/8] ext2, ext4: warn when mounting with dax enabled Dan Williams ` (7 more replies) 0 siblings, 8 replies; 20+ messages in thread From: Dan Williams @ 2015-11-17 20:15 UTC (permalink / raw) To: linux-nvdimm Cc: Jan Kara, Boaz Harrosh, Theodore Ts'o, Dave Hansen, Dave Chinner, Yigal Korman, stable, Jens Axboe, Jeff Moyer, linux-block, Matthew Wilcox, Dave Chinner, linux-fsdevel, willy, ross.zwisler, linux-ext4, akpm, Kirill A. Shutemov, Alexander Viro Changes since last posting [1]: 1/ Further cleanups to dax_clear_blocks(): Dropped increments of 'addr' since we call bdev_direct_access() before the next use, and dropped the BUG_ON for sector unaligned return values from bdev_direct_access(). 2/ In [PATCH 8/8] introduce blk_dax_ctl to remove the need to have separate dax_map_atomic and __dax_map_atomic routines. Note, blk_dax_ctl is not passed through to drivers, it gets unpacked in bdev_direct_access. (Willy) 3/ New [PATCH 2/8]: Disable huge page dax mappings while we resolve various crash scenarios in this development cycle. 4/ New [PATCH 4/8]: Unmap all dax mappings at block device shutdown I have kept the reviewed-by's received to date, let me know if these incremental updates invalidate that review. [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-November/002733.html --- The first 4 patches in this series I consider 4.4-rc / -stable material. The rest are for 4.5. [PATCH 4/8] needs scrutiny. It is yet another example of where DAX behavior necessarily differs from page cache behavior. I still maintain that we should not be surprising unaware applications with DAX semantics, i.e. that DAX should be per-inode opt-in, not globally enabled for all inodes at fs mount time. The largest patch in the set, [PATCH 8/8], addresses the lifetime of the 'addr' returned by bdev_direct_access. That address is only valid while the device driver is enabled. The new dax_map_atomic() / dax_unmap_atomic() pairing guarantees that 'addr' stays valid for the duration of that mapping. While dax_map_atomic() protects against 'addr' going invalid, the new calls to truncate_pagecache() via invalidate_inodes() protect against the 'pfn' returned from bdev_direct_access() going invalid. Otherwise, the storage media can be directly accessed after the driver has been disabled. --- [PATCH 1/8] ext2, ext4: warn when mounting with dax enabled [PATCH 2/8] dax: disable pmd mappings [PATCH 3/8] mm, dax: fix DAX deadlocks (COW fault) [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown [PATCH 5/8] pmem, dax: clean up clear_pmem() [PATCH 6/8] dax: increase granularity of dax_clear_blocks() operations [PATCH 7/8] dax: guarantee page aligned results from bdev_direct_access() [PATCH 8/8] dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() arch/x86/include/asm/pmem.h | 7 - block/blk.h | 2 fs/Kconfig | 6 + fs/block_dev.c | 15 +-- fs/dax.c | 228 ++++++++++++++++++++++++++----------------- fs/ext2/super.c | 2 fs/ext4/super.c | 6 + fs/inode.c | 27 +++++ include/linux/blkdev.h | 19 +++- mm/memory.c | 8 +- mm/truncate.c | 13 ++ 11 files changed, 217 insertions(+), 116 deletions(-) ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 1/8] ext2, ext4: warn when mounting with dax enabled 2015-11-17 20:15 [PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking Dan Williams @ 2015-11-17 20:15 ` Dan Williams 2015-11-17 20:16 ` [PATCH 2/8] dax: disable pmd mappings Dan Williams ` (6 subsequent siblings) 7 siblings, 0 replies; 20+ messages in thread From: Dan Williams @ 2015-11-17 20:15 UTC (permalink / raw) To: linux-nvdimm Cc: Theodore Ts'o, Dave Chinner, stable, linux-block, Jan Kara, linux-fsdevel, willy, ross.zwisler, linux-ext4, akpm, Kirill A. Shutemov Similar to XFS warn when mounting DAX while it is still considered under development. Also, aspects of the DAX implementation, for example synchronization against multiple faults and faults causing block allocation, depend on the correct implementation in the filesystem. The maturity of a given DAX implementation is filesystem specific. Cc: <stable@vger.kernel.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: linux-ext4@vger.kernel.org Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dave Chinner <david@fromorbit.com> Acked-by: Jan Kara <jack@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- fs/ext2/super.c | 2 ++ fs/ext4/super.c | 6 +++++- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 3a71cea68420..748d35afc902 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -569,6 +569,8 @@ static int parse_options(char *options, struct super_block *sb) /* Fall through */ case Opt_dax: #ifdef CONFIG_FS_DAX + ext2_msg(sb, KERN_WARNING, + "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"); set_opt(sbi->s_mount_opt, DAX); #else ext2_msg(sb, KERN_INFO, "dax option not supported"); diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 753f4e68b820..c9ab67da6e5a 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1664,8 +1664,12 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token, } sbi->s_jquota_fmt = m->mount_opt; #endif -#ifndef CONFIG_FS_DAX } else if (token == Opt_dax) { +#ifdef CONFIG_FS_DAX + ext4_msg(sb, KERN_WARNING, + "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"); + sbi->s_mount_opt |= m->mount_opt; +#else ext4_msg(sb, KERN_INFO, "dax option not supported"); return -1; #endif ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 2/8] dax: disable pmd mappings 2015-11-17 20:15 [PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking Dan Williams 2015-11-17 20:15 ` [PATCH 1/8] ext2, ext4: warn when mounting with dax enabled Dan Williams @ 2015-11-17 20:16 ` Dan Williams 2015-11-17 20:51 ` Ross Zwisler 2015-11-17 20:16 ` [PATCH 3/8] mm, dax: fix DAX deadlocks (COW fault) Dan Williams ` (5 subsequent siblings) 7 siblings, 1 reply; 20+ messages in thread From: Dan Williams @ 2015-11-17 20:16 UTC (permalink / raw) To: linux-nvdimm Cc: Dave Chinner, stable, linux-block, Jan Kara, linux-fsdevel, willy, ross.zwisler, akpm, Kirill A. Shutemov While dax pmd mappings are functional in the nominal path they trigger kernel crashes in the following paths: BUG: unable to handle kernel paging request at ffffea0004098000 IP: [<ffffffff812362f7>] follow_trans_huge_pmd+0x117/0x3b0 [..] Call Trace: [<ffffffff811f6573>] follow_page_mask+0x2d3/0x380 [<ffffffff811f6708>] __get_user_pages+0xe8/0x6f0 [<ffffffff811f7045>] get_user_pages_unlocked+0x165/0x1e0 [<ffffffff8106f5b1>] get_user_pages_fast+0xa1/0x1b0 kernel BUG at arch/x86/mm/gup.c:131! [..] Call Trace: [<ffffffff8106f34c>] gup_pud_range+0x1bc/0x220 [<ffffffff8106f634>] get_user_pages_fast+0x124/0x1b0 BUG: unable to handle kernel paging request at ffffea0004088000 IP: [<ffffffff81235f49>] copy_huge_pmd+0x159/0x350 [..] Call Trace: [<ffffffff811fad3c>] copy_page_range+0x34c/0x9f0 [<ffffffff810a0daf>] copy_process+0x1b7f/0x1e10 [<ffffffff810a11c1>] _do_fork+0x91/0x590 All of these paths are interpreting a dax pmd mapping as a transparent huge page and making the assumption that the pfn is covered by the memmap, i.e. that the pfn has an associated struct page. PTE mappings do not suffer the same fate since they have the _PAGE_SPECIAL flag to cause the gup path to fault. We can do something similar for the PMD path, or otherwise defer pmd support for cases where a struct page is available. For now, 4.4-rc and -stable need to disable dax pmd support by default. For development the "depends on BROKEN" line can be removed from CONFIG_FS_DAX_PMD. Cc: <stable@vger.kernel.org> Cc: Jan Kara <jack@suse.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- fs/Kconfig | 6 ++++++ fs/dax.c | 4 ++++ 2 files changed, 10 insertions(+) diff --git a/fs/Kconfig b/fs/Kconfig index da3f32f1a4e4..6ce72d8d1ee1 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -46,6 +46,12 @@ config FS_DAX or if unsure, say N. Saying Y will increase the size of the kernel by about 5kB. +config FS_DAX_PMD + bool + default FS_DAX + depends on FS_DAX + depends on BROKEN + endif # BLOCK # Posix ACL utility routines diff --git a/fs/dax.c b/fs/dax.c index d1e5cb7311a1..43671b68220e 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -541,6 +541,10 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, unsigned long pfn; int result = 0; + /* dax pmd mappings are broken wrt gup and fork */ + if (!IS_ENABLED(CONFIG_FS_DAX_PMD)) + return VM_FAULT_FALLBACK; + /* Fall back to PTEs if we're going to COW */ if (write && !(vma->vm_flags & VM_SHARED)) return VM_FAULT_FALLBACK; ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 2/8] dax: disable pmd mappings 2015-11-17 20:16 ` [PATCH 2/8] dax: disable pmd mappings Dan Williams @ 2015-11-17 20:51 ` Ross Zwisler 0 siblings, 0 replies; 20+ messages in thread From: Ross Zwisler @ 2015-11-17 20:51 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, Dave Chinner, stable, linux-block, Jan Kara, linux-fsdevel, willy, ross.zwisler, akpm, Kirill A. Shutemov On Tue, Nov 17, 2015 at 12:16:03PM -0800, Dan Williams wrote: > While dax pmd mappings are functional in the nominal path they trigger > kernel crashes in the following paths: > > BUG: unable to handle kernel paging request at ffffea0004098000 > IP: [<ffffffff812362f7>] follow_trans_huge_pmd+0x117/0x3b0 > [..] > Call Trace: > [<ffffffff811f6573>] follow_page_mask+0x2d3/0x380 > [<ffffffff811f6708>] __get_user_pages+0xe8/0x6f0 > [<ffffffff811f7045>] get_user_pages_unlocked+0x165/0x1e0 > [<ffffffff8106f5b1>] get_user_pages_fast+0xa1/0x1b0 > > kernel BUG at arch/x86/mm/gup.c:131! > [..] > Call Trace: > [<ffffffff8106f34c>] gup_pud_range+0x1bc/0x220 > [<ffffffff8106f634>] get_user_pages_fast+0x124/0x1b0 > > BUG: unable to handle kernel paging request at ffffea0004088000 > IP: [<ffffffff81235f49>] copy_huge_pmd+0x159/0x350 > [..] > Call Trace: > [<ffffffff811fad3c>] copy_page_range+0x34c/0x9f0 > [<ffffffff810a0daf>] copy_process+0x1b7f/0x1e10 > [<ffffffff810a11c1>] _do_fork+0x91/0x590 > > All of these paths are interpreting a dax pmd mapping as a transparent > huge page and making the assumption that the pfn is covered by the > memmap, i.e. that the pfn has an associated struct page. PTE mappings > do not suffer the same fate since they have the _PAGE_SPECIAL flag to > cause the gup path to fault. We can do something similar for the PMD > path, or otherwise defer pmd support for cases where a struct page is > available. For now, 4.4-rc and -stable need to disable dax pmd support > by default. > > For development the "depends on BROKEN" line can be removed from > CONFIG_FS_DAX_PMD. > > Cc: <stable@vger.kernel.org> > Cc: Jan Kara <jack@suse.com> > Cc: Dave Chinner <david@fromorbit.com> > Cc: Matthew Wilcox <willy@linux.intel.com> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Ross Zwisler <ross.zwisler@linux.intel.com> ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 3/8] mm, dax: fix DAX deadlocks (COW fault) 2015-11-17 20:15 [PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking Dan Williams 2015-11-17 20:15 ` [PATCH 1/8] ext2, ext4: warn when mounting with dax enabled Dan Williams 2015-11-17 20:16 ` [PATCH 2/8] dax: disable pmd mappings Dan Williams @ 2015-11-17 20:16 ` Dan Williams 2015-11-17 20:16 ` [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown Dan Williams ` (4 subsequent siblings) 7 siblings, 0 replies; 20+ messages in thread From: Dan Williams @ 2015-11-17 20:16 UTC (permalink / raw) To: linux-nvdimm Cc: Boaz Harrosh, Dave Chinner, stable, linux-block, Yigal Korman, Alexander Viro, Jan Kara, linux-fsdevel, willy, ross.zwisler, akpm, Kirill A. Shutemov, Matthew Wilcox From: Yigal Korman <yigal@plexistor.com> DAX handling of COW faults has wrong locking sequence: dax_fault does i_mmap_lock_read do_cow_fault does i_mmap_unlock_write Ross's commit[1] missed a fix[2] that Kirill added to Matthew's commit[3]. Original COW locking logic was introduced by Matthew here[4]. This should be applied to v4.3 as well. [1] 0f90cc6609c7 mm, dax: fix DAX deadlocks [2] 52a2b53ffde6 mm, dax: use i_mmap_unlock_write() in do_cow_fault() [3] 843172978bb9 dax: fix race between simultaneous faults [4] 2e4cdab0584f mm: allow page fault handlers to perform the COW Cc: <stable@vger.kernel.org> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Acked-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Yigal Korman <yigal@plexistor.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- mm/memory.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index deb679c31f2a..c387430f06c3 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3015,9 +3015,9 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma, } else { /* * The fault handler has no page to lock, so it holds - * i_mmap_lock for write to protect against truncate. + * i_mmap_lock for read to protect against truncate. */ - i_mmap_unlock_write(vma->vm_file->f_mapping); + i_mmap_unlock_read(vma->vm_file->f_mapping); } goto uncharge_out; } @@ -3031,9 +3031,9 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma, } else { /* * The fault handler has no page to lock, so it holds - * i_mmap_lock for write to protect against truncate. + * i_mmap_lock for read to protect against truncate. */ - i_mmap_unlock_write(vma->vm_file->f_mapping); + i_mmap_unlock_read(vma->vm_file->f_mapping); } return ret; uncharge_out: ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown 2015-11-17 20:15 [PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking Dan Williams ` (2 preceding siblings ...) 2015-11-17 20:16 ` [PATCH 3/8] mm, dax: fix DAX deadlocks (COW fault) Dan Williams @ 2015-11-17 20:16 ` Dan Williams 2015-11-18 15:09 ` Jan Kara 2015-11-17 20:16 ` [PATCH 5/8] pmem, dax: clean up clear_pmem() Dan Williams ` (3 subsequent siblings) 7 siblings, 1 reply; 20+ messages in thread From: Dan Williams @ 2015-11-17 20:16 UTC (permalink / raw) To: linux-nvdimm Cc: Dave Chinner, stable, linux-block, Jan Kara, linux-fsdevel, willy, ross.zwisler, akpm Currently dax mappings survive block_device shutdown. While page cache pages are permitted to be read/written after the block_device is torn down this is not acceptable in the dax case as all media access must end when the device is disabled. The pfn backing a dax mapping is permitted to be invalidated after bdev shutdown and this is indeed the case with brd. When a dax capable block_device driver calls del_gendisk() in its shutdown path, or a filesystem evicts an inode it needs to ensure that all the pfns that had been mapped via bdev_direct_access() are unmapped. This is different than the pagecache backed case where truncate_inode_pages() is sufficient to end I/O to pages mapped to a dying inode. Since dax bypasses the page cache we need to unmap in addition to truncating pages. Also, since dax mappings are not accounted in the mapping radix we uncoditionally truncate all inodes with the S_DAX flag. Likely when we add support for dynamic dax enable/disable control we'll have infrastructure to detect if the inode is unmapped and can skip the truncate. Cc: <stable@vger.kernel.org> Cc: Jan Kara <jack@suse.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- fs/inode.c | 27 +++++++++++++++++++++++++++ mm/truncate.c | 13 +++++++++++-- 2 files changed, 38 insertions(+), 2 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 1be5f9003eb3..1029e033e991 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -579,6 +579,18 @@ static void dispose_list(struct list_head *head) } } +static void truncate_list(struct list_head *head) +{ + struct inode *inode, *_i; + + list_for_each_entry_safe(inode, _i, head, i_lru) { + list_del_init(&inode->i_lru); + truncate_pagecache(inode, 0); + iput(inode); + cond_resched(); + } +} + /** * evict_inodes - evict all evictable inodes for a superblock * @sb: superblock to operate on @@ -642,6 +654,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty) int busy = 0; struct inode *inode, *next; LIST_HEAD(dispose); + LIST_HEAD(truncate); spin_lock(&sb->s_inode_list_lock); list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { @@ -655,6 +668,19 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty) busy = 1; continue; } + if (IS_DAX(inode) && atomic_read(&inode->i_count)) { + /* + * dax mappings can't live past this invalidation event + * as there is no page cache present to allow the data + * to remain accessiable. + */ + __iget(inode); + inode_lru_list_del(inode); + spin_unlock(&inode->i_lock); + list_add(&inode->i_lru, &truncate); + busy = 1; + continue; + } if (atomic_read(&inode->i_count)) { spin_unlock(&inode->i_lock); busy = 1; @@ -669,6 +695,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty) spin_unlock(&sb->s_inode_list_lock); dispose_list(&dispose); + truncate_list(&truncate); return busy; } diff --git a/mm/truncate.c b/mm/truncate.c index 76e35ad97102..ff1fb3b0980e 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -402,6 +402,7 @@ EXPORT_SYMBOL(truncate_inode_pages); */ void truncate_inode_pages_final(struct address_space *mapping) { + struct inode *inode = mapping->host; unsigned long nrshadows; unsigned long nrpages; @@ -423,7 +424,7 @@ void truncate_inode_pages_final(struct address_space *mapping) smp_rmb(); nrshadows = mapping->nrshadows; - if (nrpages || nrshadows) { + if (nrpages || nrshadows || IS_DAX(inode)) { /* * As truncation uses a lockless tree lookup, cycle * the tree lock to make sure any ongoing tree @@ -433,7 +434,15 @@ void truncate_inode_pages_final(struct address_space *mapping) spin_lock_irq(&mapping->tree_lock); spin_unlock_irq(&mapping->tree_lock); - truncate_inode_pages(mapping, 0); + /* + * In the case of DAX we also need to unmap the inode + * since the pfn backing the mapping may be invalidated + * after this returns + */ + if (IS_DAX(inode)) + truncate_pagecache(inode, 0); + else + truncate_inode_pages(mapping, 0); } } EXPORT_SYMBOL(truncate_inode_pages_final); ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown 2015-11-17 20:16 ` [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown Dan Williams @ 2015-11-18 15:09 ` Jan Kara 2015-11-19 0:22 ` Williams, Dan J 0 siblings, 1 reply; 20+ messages in thread From: Jan Kara @ 2015-11-18 15:09 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, Dave Chinner, stable, linux-block, Jan Kara, linux-fsdevel, willy, ross.zwisler, akpm On Tue 17-11-15 12:16:14, Dan Williams wrote: > Currently dax mappings survive block_device shutdown. While page cache > pages are permitted to be read/written after the block_device is torn > down this is not acceptable in the dax case as all media access must end > when the device is disabled. The pfn backing a dax mapping is permitted > to be invalidated after bdev shutdown and this is indeed the case with > brd. > > When a dax capable block_device driver calls del_gendisk() in its > shutdown path, or a filesystem evicts an inode it needs to ensure that > all the pfns that had been mapped via bdev_direct_access() are unmapped. > This is different than the pagecache backed case where > truncate_inode_pages() is sufficient to end I/O to pages mapped to a > dying inode. > > Since dax bypasses the page cache we need to unmap in addition to > truncating pages. Also, since dax mappings are not accounted in the > mapping radix we uncoditionally truncate all inodes with the S_DAX flag. > Likely when we add support for dynamic dax enable/disable control we'll > have infrastructure to detect if the inode is unmapped and can skip the > truncate. > > Cc: <stable@vger.kernel.org> > Cc: Jan Kara <jack@suse.com> > Cc: Dave Chinner <david@fromorbit.com> > Cc: Matthew Wilcox <willy@linux.intel.com> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > fs/inode.c | 27 +++++++++++++++++++++++++++ > mm/truncate.c | 13 +++++++++++-- > 2 files changed, 38 insertions(+), 2 deletions(-) ... > @@ -433,7 +434,15 @@ void truncate_inode_pages_final(struct address_space *mapping) > spin_lock_irq(&mapping->tree_lock); > spin_unlock_irq(&mapping->tree_lock); > > - truncate_inode_pages(mapping, 0); > + /* > + * In the case of DAX we also need to unmap the inode > + * since the pfn backing the mapping may be invalidated > + * after this returns > + */ > + if (IS_DAX(inode)) > + truncate_pagecache(inode, 0); > + else > + truncate_inode_pages(mapping, 0); > } Hum, I don't get this. truncate_inode_pages_final() gets called when inode has no more users. So there are no mappings of the inode. So how could truncate_pagecache() possibly make a difference? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown 2015-11-18 15:09 ` Jan Kara @ 2015-11-19 0:22 ` Williams, Dan J 2015-11-19 12:55 ` Jan Kara 0 siblings, 1 reply; 20+ messages in thread From: Williams, Dan J @ 2015-11-19 0:22 UTC (permalink / raw) To: jack@suse.cz Cc: linux-block@vger.kernel.org, stable@vger.kernel.org, akpm@linux-foundation.org, linux-nvdimm@lists.01.org, willy@linux.intel.com, linux-fsdevel@vger.kernel.org, ross.zwisler@linux.intel.com, jack@suse.com, david@fromorbit.com On Wed, 2015-11-18 at 16:09 +0100, Jan Kara wrote: > Hum, I don't get this. truncate_inode_pages_final() gets called when inode > has no more users. So there are no mappings of the inode. So how could > truncate_pagecache() possibly make a difference? True. I confirmed with more focus testing that the change to truncate_inode_pages_final() is not necessary. After invalidate_inodes() does unmap_mapping_range() we are protected by future calls to get_block() and blk_queue_enter() failing when there are attempts to re-establish a mapping after the block device has been torn down. Here's a revised patch. Note that the call truncate_pagecache() is replaced with a call to unmap_mapping_range() since it is fine to access zero pages that might still be in the page cache. 8<---- Subject: mm, dax: unmap dax mappings at bdev or fs shutdown From: Dan Williams <dan.j.williams@intel.com> Currently dax mappings leak past / survive block_device shutdown. While page cache pages are permitted to be read/written after the block_device is torn down this is not acceptable in the dax case as all media access must end when the device is disabled. The pfn backing a dax mapping is permitted to be invalidated after bdev shutdown and this is indeed the case with brd. When a dax capable block_device driver calls del_gendisk() in its shutdown path del_gendisk() needs to ensure that all DAX pfns are unmapped. This is different than the pagecache backed case where the disk is protected by the queue being torn down which ends I/O to the device. Since dax bypasses the page cache we need to unconditionally unmap the inode. Cc: <stable@vger.kernel.org> Cc: Jan Kara <jack@suse.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> [honza: drop changes to truncate_inode_pages_final] Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- fs/inode.c | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/fs/inode.c b/fs/inode.c index 1be5f9003eb3..dcb31d2c15e6 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -579,6 +579,18 @@ static void dispose_list(struct list_head *head) } } +static void unmap_list(struct list_head *head) +{ + struct inode *inode, *_i; + + list_for_each_entry_safe(inode, _i, head, i_lru) { + list_del_init(&inode->i_lru); + unmap_mapping_range(&inode->i_data, 0, 0, 1); + iput(inode); + cond_resched(); + } +} + /** * evict_inodes - evict all evictable inodes for a superblock * @sb: superblock to operate on @@ -642,6 +654,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty) int busy = 0; struct inode *inode, *next; LIST_HEAD(dispose); + LIST_HEAD(unmap); spin_lock(&sb->s_inode_list_lock); list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { @@ -655,6 +668,19 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty) busy = 1; continue; } + if (IS_DAX(inode) && atomic_read(&inode->i_count)) { + /* + * dax mappings can't live past this invalidation event + * as there is no page cache present to allow the data + * to remain accessible. + */ + __iget(inode); + inode_lru_list_del(inode); + spin_unlock(&inode->i_lock); + list_add(&inode->i_lru, &unmap); + busy = 1; + continue; + } if (atomic_read(&inode->i_count)) { spin_unlock(&inode->i_lock); busy = 1; @@ -669,6 +695,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty) spin_unlock(&sb->s_inode_list_lock); dispose_list(&dispose); + unmap_list(&unmap); return busy; } ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown 2015-11-19 0:22 ` Williams, Dan J @ 2015-11-19 12:55 ` Jan Kara 2015-11-19 16:55 ` Dan Williams 0 siblings, 1 reply; 20+ messages in thread From: Jan Kara @ 2015-11-19 12:55 UTC (permalink / raw) To: Williams, Dan J Cc: jack@suse.cz, linux-block@vger.kernel.org, stable@vger.kernel.org, akpm@linux-foundation.org, linux-nvdimm@lists.01.org, willy@linux.intel.com, linux-fsdevel@vger.kernel.org, ross.zwisler@linux.intel.com, jack@suse.com, david@fromorbit.com > Subject: mm, dax: unmap dax mappings at bdev or fs shutdown > From: Dan Williams <dan.j.williams@intel.com> > > Currently dax mappings leak past / survive block_device shutdown.��While > page cache pages are permitted to be read/written after the block_device > is torn down this is not acceptable in the dax case as all media access > must end when the device is disabled.��The pfn backing a dax mapping is > permitted to be invalidated after bdev shutdown and this is indeed the > case with brd. > > When a dax capable block_device driver calls del_gendisk() in its > shutdown path del_gendisk() needs to ensure that all DAX pfns are > unmapped.��This is different than the pagecache backed case where the > disk is protected by the queue being torn down which ends I/O to the > device.��Since dax bypasses the page cache we need to unconditionally > unmap the inode. ... > +static void unmap_list(struct list_head *head) > +{ > + struct inode *inode, *_i; > + > + list_for_each_entry_safe(inode, _i, head, i_lru) { > + list_del_init(&inode->i_lru); > + unmap_mapping_range(&inode->i_data, 0, 0, 1); > + iput(inode); > + cond_resched(); > + } > +} > + > �/** > � * evict_inodes - evict all evictable inodes for a superblock > � * @sb: superblock to operate on > @@ -642,6 +654,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty) > � int busy = 0; > � struct inode *inode, *next; > � LIST_HEAD(dispose); > + LIST_HEAD(unmap); > � > � spin_lock(&sb->s_inode_list_lock); > � list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { > @@ -655,6 +668,19 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty) > � busy = 1; > � continue; > � } > + if (IS_DAX(inode) && atomic_read(&inode->i_count)) { > + /* > + �* dax mappings can't live past this invalidation event > + �* as there is no page cache present to allow the data > + �* to remain accessible. > + �*/ > + __iget(inode); > + inode_lru_list_del(inode); > + spin_unlock(&inode->i_lock); > + list_add(&inode->i_lru, &unmap); > + busy = 1; > + continue; > + } I'm somewhat concerned about the abuse of i_lru list here. The inodes have i_count > 0 and so the code generally assumes such inodes should be removed from LRU list if they are there. Now I didn't find a place where this could really hit you but it is fragile. And when that happens, you have a corruption of your unmap list (since you access it without any locks) and also you will not unmap your inode. Also are you sure that your unmapping cannot race with other process mapping the pfns again? BTW what happens when you access a DAX pfn that got removed? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown 2015-11-19 12:55 ` Jan Kara @ 2015-11-19 16:55 ` Dan Williams 2015-11-19 17:12 ` Jan Kara 2015-11-19 23:17 ` Dave Chinner 0 siblings, 2 replies; 20+ messages in thread From: Dan Williams @ 2015-11-19 16:55 UTC (permalink / raw) To: Jan Kara Cc: linux-block@vger.kernel.org, stable@vger.kernel.org, akpm@linux-foundation.org, linux-nvdimm@lists.01.org, willy@linux.intel.com, linux-fsdevel@vger.kernel.org, ross.zwisler@linux.intel.com, jack@suse.com, david@fromorbit.com On Thu, Nov 19, 2015 at 4:55 AM, Jan Kara <jack@suse.cz> wrote: >> Subject: mm, dax: unmap dax mappings at bdev or fs shutdown >> From: Dan Williams <dan.j.williams@intel.com> >> >> Currently dax mappings leak past / survive block_device shutdown. While >> page cache pages are permitted to be read/written after the block_device >> is torn down this is not acceptable in the dax case as all media access >> must end when the device is disabled. The pfn backing a dax mapping is >> permitted to be invalidated after bdev shutdown and this is indeed the >> case with brd. >> >> When a dax capable block_device driver calls del_gendisk() in its >> shutdown path del_gendisk() needs to ensure that all DAX pfns are >> unmapped. This is different than the pagecache backed case where the >> disk is protected by the queue being torn down which ends I/O to the >> device. Since dax bypasses the page cache we need to unconditionally >> unmap the inode. > ... >> +static void unmap_list(struct list_head *head) >> +{ >> + struct inode *inode, *_i; >> + >> + list_for_each_entry_safe(inode, _i, head, i_lru) { >> + list_del_init(&inode->i_lru); >> + unmap_mapping_range(&inode->i_data, 0, 0, 1); >> + iput(inode); >> + cond_resched(); >> + } >> +} >> + >> /** >> * evict_inodes - evict all evictable inodes for a superblock >> * @sb: superblock to operate on >> @@ -642,6 +654,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty) >> int busy = 0; >> struct inode *inode, *next; >> LIST_HEAD(dispose); >> + LIST_HEAD(unmap); >> >> spin_lock(&sb->s_inode_list_lock); >> list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { >> @@ -655,6 +668,19 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty) >> busy = 1; >> continue; >> } >> + if (IS_DAX(inode) && atomic_read(&inode->i_count)) { >> + /* >> + * dax mappings can't live past this invalidation event >> + * as there is no page cache present to allow the data >> + * to remain accessible. >> + */ >> + __iget(inode); >> + inode_lru_list_del(inode); >> + spin_unlock(&inode->i_lock); >> + list_add(&inode->i_lru, &unmap); >> + busy = 1; >> + continue; >> + } > > I'm somewhat concerned about the abuse of i_lru list here. The inodes have > i_count > 0 and so the code generally assumes such inodes should be removed > from LRU list if they are there. Now I didn't find a place where this could > really hit you but it is fragile. And when that happens, you have a > corruption of your unmap list (since you access it without any locks) and > also you will not unmap your inode. "i_lru" list abuse was my main concern with this patch, I'll look into a different way. > Also are you sure that your unmapping cannot race with other process > mapping the pfns again? You're right, there is indeed a gap between the unmap and when get_blocks() will start returning errors in the fault path. I think I need to defer the unmapping until after blk_cleanup_queue() where we know that no new I/o to the device is possible. > BTW what happens when you access a DAX pfn that got removed? SIGBUS. I don't see a way to be any kinder to the application. After the ->remove() method for the block_device is complete we can't be sure that the pfn is valid or even present in the system (think brd, or VM hot provisioning). ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown 2015-11-19 16:55 ` Dan Williams @ 2015-11-19 17:12 ` Jan Kara 2015-11-19 23:17 ` Dave Chinner 1 sibling, 0 replies; 20+ messages in thread From: Jan Kara @ 2015-11-19 17:12 UTC (permalink / raw) To: Dan Williams Cc: Jan Kara, linux-block@vger.kernel.org, stable@vger.kernel.org, akpm@linux-foundation.org, linux-nvdimm@lists.01.org, willy@linux.intel.com, linux-fsdevel@vger.kernel.org, ross.zwisler@linux.intel.com, jack@suse.com, david@fromorbit.com On Thu 19-11-15 08:55:48, Dan Williams wrote: > On Thu, Nov 19, 2015 at 4:55 AM, Jan Kara <jack@suse.cz> wrote: > > Also are you sure that your unmapping cannot race with other process > > mapping the pfns again? > > You're right, there is indeed a gap between the unmap and when > get_blocks() will start returning errors in the fault path. I think I > need to defer the unmapping until after blk_cleanup_queue() where we > know that no new I/o to the device is possible. Yeah, you need to squeeze it somewhere after the point where get_blocks() start returning errors and before the point where pfn can go away. > > BTW what happens when you access a DAX pfn that got removed? > > SIGBUS. I don't see a way to be any kinder to the application. After > the ->remove() method for the block_device is complete we can't be > sure that the pfn is valid or even present in the system (think brd, > or VM hot provisioning). I see. So if we kept the PFN mapped, it could result e.g. in memory corruption (at least in case of brd). So we really need this to be 100% reliable. That's what I was interested in. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown 2015-11-19 16:55 ` Dan Williams 2015-11-19 17:12 ` Jan Kara @ 2015-11-19 23:17 ` Dave Chinner 2015-11-20 0:05 ` Williams, Dan J 1 sibling, 1 reply; 20+ messages in thread From: Dave Chinner @ 2015-11-19 23:17 UTC (permalink / raw) To: Dan Williams Cc: Jan Kara, linux-block@vger.kernel.org, stable@vger.kernel.org, akpm@linux-foundation.org, linux-nvdimm@lists.01.org, willy@linux.intel.com, linux-fsdevel@vger.kernel.org, ross.zwisler@linux.intel.com, jack@suse.com On Thu, Nov 19, 2015 at 08:55:48AM -0800, Dan Williams wrote: > On Thu, Nov 19, 2015 at 4:55 AM, Jan Kara <jack@suse.cz> wrote: > >> Subject: mm, dax: unmap dax mappings at bdev or fs shutdown > >> From: Dan Williams <dan.j.williams@intel.com> > >> > >> Currently dax mappings leak past / survive block_device shutdown. While > >> page cache pages are permitted to be read/written after the block_device > >> is torn down this is not acceptable in the dax case as all media access > >> must end when the device is disabled. The pfn backing a dax mapping is > >> permitted to be invalidated after bdev shutdown and this is indeed the > >> case with brd. > >> > >> When a dax capable block_device driver calls del_gendisk() in its > >> shutdown path del_gendisk() needs to ensure that all DAX pfns are > >> unmapped. This is different than the pagecache backed case where the > >> disk is protected by the queue being torn down which ends I/O to the > >> device. Since dax bypasses the page cache we need to unconditionally > >> unmap the inode. > > ... > >> +static void unmap_list(struct list_head *head) > >> +{ > >> + struct inode *inode, *_i; > >> + > >> + list_for_each_entry_safe(inode, _i, head, i_lru) { > >> + list_del_init(&inode->i_lru); > >> + unmap_mapping_range(&inode->i_data, 0, 0, 1); > >> + iput(inode); > >> + cond_resched(); > >> + } > >> +} > >> + > >> /** > >> * evict_inodes - evict all evictable inodes for a superblock > >> * @sb: superblock to operate on > >> @@ -642,6 +654,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty) > >> int busy = 0; > >> struct inode *inode, *next; > >> LIST_HEAD(dispose); > >> + LIST_HEAD(unmap); > >> > >> spin_lock(&sb->s_inode_list_lock); > >> list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { > >> @@ -655,6 +668,19 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty) > >> busy = 1; > >> continue; > >> } > >> + if (IS_DAX(inode) && atomic_read(&inode->i_count)) { > >> + /* > >> + * dax mappings can't live past this invalidation event > >> + * as there is no page cache present to allow the data > >> + * to remain accessible. > >> + */ > >> + __iget(inode); > >> + inode_lru_list_del(inode); > >> + spin_unlock(&inode->i_lock); > >> + list_add(&inode->i_lru, &unmap); > >> + busy = 1; > >> + continue; > >> + } > > > > I'm somewhat concerned about the abuse of i_lru list here. The inodes have > > i_count > 0 and so the code generally assumes such inodes should be removed > > from LRU list if they are there. Now I didn't find a place where this could > > really hit you but it is fragile. And when that happens, you have a > > corruption of your unmap list (since you access it without any locks) and > > also you will not unmap your inode. > > "i_lru" list abuse was my main concern with this patch, I'll look into > a different way. Yeah, you can't use i_lru at all for purposes like this - even if there are active references to the inode, the shrinker could be walking the LRU list and accessing this inode (e.g. inode_lru_isolate()) at the same time this code is removing it from the LRU. Then things will go real bad.... > > Also are you sure that your unmapping cannot race with other process > > mapping the pfns again? > > You're right, there is indeed a gap between the unmap and when > get_blocks() will start returning errors in the fault path. get_blocks() won't start returning errors until the filesystem has had an IO error. Given they cache metadata and do async transactions, it could be some time before the filesystem notices that the device has gone away in a fatal way. > I think I > need to defer the unmapping until after blk_cleanup_queue() where we > know that no new I/o to the device is possible. Actually, I think we need to trigger a filesystem shutdown before doing anything else (e.g. before unmapping the inodes). That way the filesystem will error out new calls to get_blocks() and prevent any new IO submission while the block device teardown and inode unmapping is done. i.e. solving the problem at the block device level is hard, but we already have all the necessary infrastructure to shut off IO and new block mappings at the filesystem level.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown 2015-11-19 23:17 ` Dave Chinner @ 2015-11-20 0:05 ` Williams, Dan J 2015-11-20 4:06 ` Dave Chinner 0 siblings, 1 reply; 20+ messages in thread From: Williams, Dan J @ 2015-11-20 0:05 UTC (permalink / raw) To: david@fromorbit.com Cc: linux-block@vger.kernel.org, stable@vger.kernel.org, akpm@linux-foundation.org, linux-nvdimm@lists.01.org, willy@linux.intel.com, linux-fsdevel@vger.kernel.org, ross.zwisler@linux.intel.com, jack@suse.com, jack@suse.cz On Fri, 2015-11-20 at 10:17 +1100, Dave Chinner wrote: > Actually, I think we need to trigger a filesystem shutdown before > doing anything else (e.g. before unmapping the inodes).That way the > filesystem will error out new calls to get_blocks() and prevent any > new IO submission while the block device teardown and inode > unmapping is done. i.e. solving the problem at the block device > level is hard, but we already have all the necessary infrastructure > to shut off IO and new block mappings at the filesystem level.... > Shutting down the filesystem on block_device remove seems a more invasive behavior change from what we've historically done. I.e. a best effort "let the fs struggle on to service whatever it can that is not dependent on new disk I/O". Solving it at the block layer isn't that hard now that we have blk_queue_enter() to cheaply test if the block_device is dying or dead. Below is an updated attempt that borrows a common inode walking pattern, and simply reorders the order of operations done by del_gendisk and blk_cleanup_queue. Note that this depends on dax_map_atomic() from "[PATCH 8/8] dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()" (https://list s.01.org/pipermail/linux-nvdimm/2015-November/002882.html). Where dax_map_atomic() checks that the queue is still live before allowing new calls to bdev_direct_access(). 8<----- Subject: mm, dax: unmap dax mappings at bdev shutdown From: Dan Williams <dan.j.williams@intel.com> Currently dax mappings leak past / survive block_device shutdown. While page cache pages are permitted to be read/written after the block_device is torn down this is not acceptable in the dax case as all media access must end when the device is disabled. The pfn backing a dax mapping is permitted to be invalidated after bdev shutdown and this is indeed the case with brd. When a dax capable block_device driver calls del_gendisk_queue() in its shutdown path it needs to ensure that all DAX pfns are unmapped, and that no new mappings can be established. This is different than the pagecache backed case where the disk is protected by the queue being torn down which ends I/O to the device. Since dax bypasses the page cache we need to unconditionally unmap the inode. Cc: Jan Kara <jack@suse.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> [honza: drop changes to truncate_inode_pages_final] [honza: ensure mappings can't be re-established] Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- block/genhd.c | 88 +++++++++++++++++++++++++++++++++++------- drivers/block/brd.c | 3 - drivers/nvdimm/pmem.c | 3 - drivers/s390/block/dcssblk.c | 6 +-- fs/block_dev.c | 41 ++++++++++++++++++++ include/linux/fs.h | 1 include/linux/genhd.h | 1 7 files changed, 121 insertions(+), 22 deletions(-) diff --git a/block/genhd.c b/block/genhd.c index e5cafa51567c..37ab780c0937 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -634,24 +634,14 @@ void add_disk(struct gendisk *disk) } EXPORT_SYMBOL(add_disk); -void del_gendisk(struct gendisk *disk) +static void del_gendisk_start(struct gendisk *disk) { - struct disk_part_iter piter; - struct hd_struct *part; - blk_integrity_del(disk); disk_del_events(disk); +} - /* invalidate stuff */ - disk_part_iter_init(&piter, disk, - DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE); - while ((part = disk_part_iter_next(&piter))) { - invalidate_partition(disk, part->partno); - delete_partition(disk, part->partno); - } - disk_part_iter_exit(&piter); - - invalidate_partition(disk, 0); +static void del_gendisk_end(struct gendisk *disk) +{ set_capacity(disk, 0); disk->flags &= ~GENHD_FL_UP; @@ -670,9 +660,79 @@ void del_gendisk(struct gendisk *disk) pm_runtime_set_memalloc_noio(disk_to_dev(disk), false); device_del(disk_to_dev(disk)); } + +#define for_each_part(part, piter) \ + for (part = disk_part_iter_next(piter); part; \ + part = disk_part_iter_next(piter)) +void del_gendisk(struct gendisk *disk) +{ + struct disk_part_iter piter; + struct hd_struct *part; + + del_gendisk_start(disk); + + /* invalidate stuff */ + disk_part_iter_init(&piter, disk, + DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE); + for_each_part(part, &piter) { + invalidate_partition(disk, part->partno); + delete_partition(disk, part->partno); + } + disk_part_iter_exit(&piter); + invalidate_partition(disk, 0); + + del_gendisk_end(disk); +} EXPORT_SYMBOL(del_gendisk); /** + * del_gendisk_queue - combined del_gendisk + blk_cleanup_queue + * @disk: disk to delete, invalidate, unmap, and end i/o + * + * This alternative for open coded calls to: + * del_gendisk() + * blk_cleanup_queue() + * ...is for ->direct_access() (DAX) capable devices. DAX bypasses page + * cache and mappings go directly to storage media. When such a disk is + * removed the pfn backing a mapping may be invalid or removed from the + * system. Upon return accessing DAX mappings of this disk will trigger + * SIGBUS. + */ +void del_gendisk_queue(struct gendisk *disk) +{ + struct disk_part_iter piter; + struct hd_struct *part; + + del_gendisk_start(disk); + + /* pass1 sync fs + evict idle inodes */ + disk_part_iter_init(&piter, disk, + DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE); + for_each_part(part, &piter) + invalidate_partition(disk, part->partno); + disk_part_iter_exit(&piter); + invalidate_partition(disk, 0); + + blk_cleanup_queue(disk->queue); + + /* + * pass2 now that the queue is dead unmap DAX inodes, and delete + * partitions + */ + disk_part_iter_init(&piter, disk, + DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE); + for_each_part(part, &piter) { + unmap_partition(disk, part->partno); + delete_partition(disk, part->partno); + } + disk_part_iter_exit(&piter); + unmap_partition(disk, 0); + + del_gendisk_end(disk); +} +EXPORT_SYMBOL(del_gendisk_queue); + +/** * get_gendisk - get partitioning information for a given device * @devt: device to get partitioning information for * @partno: returned partition index diff --git a/drivers/block/brd.c b/drivers/block/brd.c index a5880f4ab40e..f149dd57bba3 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -532,7 +532,6 @@ out: static void brd_free(struct brd_device *brd) { put_disk(brd->brd_disk); - blk_cleanup_queue(brd->brd_queue); brd_free_pages(brd); kfree(brd); } @@ -560,7 +559,7 @@ out: static void brd_del_one(struct brd_device *brd) { list_del(&brd->brd_list); - del_gendisk(brd->brd_disk); + del_gendisk_queue(brd->brd_disk); brd_free(brd); } diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 8ee79893d2f5..6dd06e9d34b0 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -158,9 +158,8 @@ static void pmem_detach_disk(struct pmem_device *pmem) if (!pmem->pmem_disk) return; - del_gendisk(pmem->pmem_disk); + del_gendisk_queue(pmem->pmem_disk); put_disk(pmem->pmem_disk); - blk_cleanup_queue(pmem->pmem_queue); } static int pmem_attach_disk(struct device *dev, diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c index 94a8f4ab57bc..0c3c968b57d9 100644 --- a/drivers/s390/block/dcssblk.c +++ b/drivers/s390/block/dcssblk.c @@ -388,8 +388,7 @@ removeseg: } list_del(&dev_info->lh); - del_gendisk(dev_info->gd); - blk_cleanup_queue(dev_info->dcssblk_queue); + del_gendisk_queue(dev_info->gd); dev_info->gd->queue = NULL; put_disk(dev_info->gd); up_write(&dcssblk_devices_sem); @@ -751,8 +750,7 @@ dcssblk_remove_store(struct device *dev, struct device_attribute *attr, const ch } list_del(&dev_info->lh); - del_gendisk(dev_info->gd); - blk_cleanup_queue(dev_info->dcssblk_queue); + del_gendisk_queue(dev_info->gd); dev_info->gd->queue = NULL; put_disk(dev_info->gd); device_unregister(&dev_info->dev); diff --git a/fs/block_dev.c b/fs/block_dev.c index 4c14d4467bbf..975d32b5623b 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1795,6 +1795,47 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty) } EXPORT_SYMBOL(__invalidate_device); +void unmap_partition(struct gendisk *disk, int partno) +{ + bool dax = !!disk->fops->direct_access; + struct block_device *bdev = dax ? bdget_disk(disk, partno) : NULL; + struct super_block *sb = get_super(bdev); + struct inode *inode, *_inode = NULL; + + if (!bdev) + return; + + if (!sb) { + bdput(bdev); + return; + } + + spin_lock(&sb->s_inode_list_lock); + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { + spin_lock(&inode->i_lock); + if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) + || !IS_DAX(inode)) { + spin_unlock(&inode->i_lock); + continue; + } + __iget(inode); + spin_unlock(&inode->i_lock); + spin_unlock(&sb->s_inode_list_lock); + + unmap_mapping_range(inode->i_mapping, 0, 0, 1); + iput(_inode); + _inode = inode; + cond_resched(); + + spin_lock(&sb->s_inode_list_lock); + } + spin_unlock(&sb->s_inode_list_lock); + iput(_inode); + + drop_super(sb); + bdput(bdev); +} + void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg) { struct inode *inode, *old_inode = NULL; diff --git a/include/linux/fs.h b/include/linux/fs.h index 3aa514254161..d2dda249abf8 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2390,6 +2390,7 @@ extern int revalidate_disk(struct gendisk *); extern int check_disk_change(struct block_device *); extern int __invalidate_device(struct block_device *, bool); extern int invalidate_partition(struct gendisk *, int); +extern void unmap_partition(struct gendisk *, int); #endif unsigned long invalidate_mapping_pages(struct address_space *mapping, pgoff_t start, pgoff_t end); diff --git a/include/linux/genhd.h b/include/linux/genhd.h index 847cc1d91634..028cf15a8a57 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -431,6 +431,7 @@ extern void part_round_stats(int cpu, struct hd_struct *part); /* block/genhd.c */ extern void add_disk(struct gendisk *disk); extern void del_gendisk(struct gendisk *gp); +extern void del_gendisk_queue(struct gendisk *disk); extern struct gendisk *get_gendisk(dev_t dev, int *partno); extern struct block_device *bdget_disk(struct gendisk *disk, int partno); ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown 2015-11-20 0:05 ` Williams, Dan J @ 2015-11-20 4:06 ` Dave Chinner 2015-11-20 4:25 ` Dan Williams 0 siblings, 1 reply; 20+ messages in thread From: Dave Chinner @ 2015-11-20 4:06 UTC (permalink / raw) To: Williams, Dan J Cc: linux-block@vger.kernel.org, stable@vger.kernel.org, akpm@linux-foundation.org, linux-nvdimm@lists.01.org, willy@linux.intel.com, linux-fsdevel@vger.kernel.org, ross.zwisler@linux.intel.com, jack@suse.com, jack@suse.cz On Fri, Nov 20, 2015 at 12:05:11AM +0000, Williams, Dan J wrote: > On Fri, 2015-11-20 at 10:17 +1100, Dave Chinner wrote: > > Actually, I think we need to trigger a filesystem shutdown before > > doing anything else (e.g. before unmapping the inodes).That way the > > filesystem will error out new calls to get_blocks() and prevent any > > new IO submission while the block device teardown and inode > > unmapping is done. i.e. solving the problem at the block device > > level is hard, but we already have all the necessary infrastructure > > to shut off IO and new block mappings at the filesystem level.... > > > > Shutting down the filesystem on block_device remove seems a more > invasive behavior change from what we've historically done. I've heard that so many times I figured that would be your answer. yet we've got a clear situation where we have a race between file level access and block device operations that is clearly solved by doing an upfront filesystem shutdown on unplug, but still the answer is "ignore the filesystem, we need to do everything in the block layer, no matter how hard or complex it is to solve"... > �I.e. a > best effort "let the fs struggle on to service whatever it can that is > not dependent on new disk I/O". And so we still have this limbo fs state that is an utter nightmare to handle sanely. We don't know what the cause of the IO error are and so we have to try to handle them as though we can recover in some way from the error. Only when we get an error we can't possibly recover from do we shut the fileystem down and then stop all attempts at issuing IO, mapping page faults, etc. However, if the device has been unplugged then we *can never recover* and so continuing on with out eyes closed and fingers in our eyes shouting 'lalalalalalala" as loud as we can won't change the fact that we are going to shut down the filesystem in the near future. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown 2015-11-20 4:06 ` Dave Chinner @ 2015-11-20 4:25 ` Dan Williams 2015-11-20 17:08 ` Dan Williams 0 siblings, 1 reply; 20+ messages in thread From: Dan Williams @ 2015-11-20 4:25 UTC (permalink / raw) To: Dave Chinner Cc: jack@suse.cz, linux-nvdimm@lists.01.org, stable@vger.kernel.org, linux-block@vger.kernel.org, jack@suse.com, linux-fsdevel@vger.kernel.org, akpm@linux-foundation.org On Thu, Nov 19, 2015 at 8:06 PM, Dave Chinner <david@fromorbit.com> wrote: > On Fri, Nov 20, 2015 at 12:05:11AM +0000, Williams, Dan J wrote: >> On Fri, 2015-11-20 at 10:17 +1100, Dave Chinner wrote: >> > Actually, I think we need to trigger a filesystem shutdown before >> > doing anything else (e.g. before unmapping the inodes).That way the >> > filesystem will error out new calls to get_blocks() and prevent any >> > new IO submission while the block device teardown and inode >> > unmapping is done. i.e. solving the problem at the block device >> > level is hard, but we already have all the necessary infrastructure >> > to shut off IO and new block mappings at the filesystem level.... >> > >> >> Shutting down the filesystem on block_device remove seems a more >> invasive behavior change from what we've historically done. > > I've heard that so many times I figured that would be your answer. > yet we've got a clear situation where we have a race between file > level access and block device operations that is clearly solved by > doing an upfront filesystem shutdown on unplug, but still the answer > is "ignore the filesystem, we need to do everything in the block > layer, no matter how hard or complex it is to solve"... I was only disagreeing with your assertion that solving this particular problem in the block layer was hard, and not asserting that this needs to be solved in the block layer at all costs. >> I.e. a >> best effort "let the fs struggle on to service whatever it can that is >> not dependent on new disk I/O". > > And so we still have this limbo fs state that is an utter nightmare > to handle sanely. We don't know what the cause of the IO error are > and so we have to try to handle them as though we can recover in > some way from the error. Only when we get an error we can't possibly > recover from do we shut the fileystem down and then stop all > attempts at issuing IO, mapping page faults, etc. > > However, if the device has been unplugged then we *can never > recover* and so continuing on with out eyes closed and fingers in > our eyes shouting 'lalalalalalala" as loud as we can won't change > the fact that we are going to shut down the filesystem in the near > future. > Can't argue with that, and XFS takes a lot longer to mourn the loss of the block device than ext4. What would be the recommended interface to tell XFS to sync if it can, but give up quickly if it hits an error and then shutdown permanently? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown 2015-11-20 4:25 ` Dan Williams @ 2015-11-20 17:08 ` Dan Williams 0 siblings, 0 replies; 20+ messages in thread From: Dan Williams @ 2015-11-20 17:08 UTC (permalink / raw) To: Dave Chinner Cc: jack@suse.cz, linux-nvdimm@lists.01.org, stable@vger.kernel.org, linux-block@vger.kernel.org, jack@suse.com, linux-fsdevel@vger.kernel.org, akpm@linux-foundation.org On Thu, Nov 19, 2015 at 8:25 PM, Dan Williams <dan.j.williams@intel.com> wrote: > On Thu, Nov 19, 2015 at 8:06 PM, Dave Chinner <david@fromorbit.com> wrote: [..] > What would be the recommended interface to tell XFS to sync if it can, > but give up quickly if it hits an error and then shutdown permanently? Thinking about this a bit further I think a "give up" notification to a filesystem is an incremental addition to the common vfs layer invalidate_partition() and now unmap_partition(). Because "sync if you can, but give up quickly" is handled by fsync_bdev() and generic_make_request() returning immediate errors. It's only the file system triggered retries that we need to turn off. ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 5/8] pmem, dax: clean up clear_pmem() 2015-11-17 20:15 [PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking Dan Williams ` (3 preceding siblings ...) 2015-11-17 20:16 ` [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown Dan Williams @ 2015-11-17 20:16 ` Dan Williams 2015-11-17 20:16 ` [PATCH 6/8] dax: increase granularity of dax_clear_blocks() operations Dan Williams ` (2 subsequent siblings) 7 siblings, 0 replies; 20+ messages in thread From: Dan Williams @ 2015-11-17 20:16 UTC (permalink / raw) To: linux-nvdimm Cc: Dave Hansen, linux-block, Jeff Moyer, linux-fsdevel, willy, ross.zwisler, akpm Both, __dax_pmd_fault, and clear_pmem() were taking special steps to clear memory a page at a time to take advantage of non-temporal clear_page() implementations. However, x86_64 does not use non-temporal instructions for clear_page(), and arch_clear_pmem() was always incurring the cost of __arch_wb_cache_pmem(). Clean up the assumption that doing clear_pmem() a page at a time is more performant. Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- arch/x86/include/asm/pmem.h | 7 +------ fs/dax.c | 4 +--- 2 files changed, 2 insertions(+), 9 deletions(-) diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h index d8ce3ec816ab..1544fabcd7f9 100644 --- a/arch/x86/include/asm/pmem.h +++ b/arch/x86/include/asm/pmem.h @@ -132,12 +132,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) { void *vaddr = (void __force *)addr; - /* TODO: implement the zeroing via non-temporal writes */ - if (size == PAGE_SIZE && ((unsigned long)vaddr & ~PAGE_MASK) == 0) - clear_page(vaddr); - else - memset(vaddr, 0, size); - + memset(vaddr, 0, size); __arch_wb_cache_pmem(vaddr, size); } diff --git a/fs/dax.c b/fs/dax.c index 43671b68220e..19492cc65a30 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -641,9 +641,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, goto fallback; if (buffer_unwritten(&bh) || buffer_new(&bh)) { - int i; - for (i = 0; i < PTRS_PER_PMD; i++) - clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE); + clear_pmem(kaddr, PMD_SIZE); wmb_pmem(); count_vm_event(PGMAJFAULT); mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 6/8] dax: increase granularity of dax_clear_blocks() operations 2015-11-17 20:15 [PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking Dan Williams ` (4 preceding siblings ...) 2015-11-17 20:16 ` [PATCH 5/8] pmem, dax: clean up clear_pmem() Dan Williams @ 2015-11-17 20:16 ` Dan Williams 2015-11-17 20:16 ` [PATCH 7/8] dax: guarantee page aligned results from bdev_direct_access() Dan Williams 2015-11-17 20:16 ` [PATCH 8/8] dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() Dan Williams 7 siblings, 0 replies; 20+ messages in thread From: Dan Williams @ 2015-11-17 20:16 UTC (permalink / raw) To: linux-nvdimm Cc: linux-block, Jeff Moyer, Jan Kara, linux-fsdevel, willy, ross.zwisler, akpm dax_clear_blocks is currently performing a cond_resched() after every PAGE_SIZE memset. We need not check so frequently, for example md-raid only calls cond_resched() at stripe granularity. Also, in preparation for introducing a dax_map_atomic() operation that temporarily pins a dax mapping move the call to cond_resched() to the outer loop. Reviewed-by: Jan Kara <jack@suse.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- fs/dax.c | 22 ++++++++-------------- 1 file changed, 8 insertions(+), 14 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 19492cc65a30..e11d88835bb2 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -28,6 +28,7 @@ #include <linux/sched.h> #include <linux/uio.h> #include <linux/vmstat.h> +#include <linux/sizes.h> /* * dax_clear_blocks() is called from within transaction context from XFS, @@ -43,24 +44,17 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size) do { void __pmem *addr; unsigned long pfn; - long count; + long count, sz; count = bdev_direct_access(bdev, sector, &addr, &pfn, size); if (count < 0) return count; - BUG_ON(size < count); - while (count > 0) { - unsigned pgsz = PAGE_SIZE - offset_in_page(addr); - if (pgsz > count) - pgsz = count; - clear_pmem(addr, pgsz); - addr += pgsz; - size -= pgsz; - count -= pgsz; - BUG_ON(pgsz & 511); - sector += pgsz / 512; - cond_resched(); - } + sz = min_t(long, count, SZ_1M); + clear_pmem(addr, sz); + size -= sz; + BUG_ON(sz & 511); + sector += sz / 512; + cond_resched(); } while (size); wmb_pmem(); ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 7/8] dax: guarantee page aligned results from bdev_direct_access() 2015-11-17 20:15 [PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking Dan Williams ` (5 preceding siblings ...) 2015-11-17 20:16 ` [PATCH 6/8] dax: increase granularity of dax_clear_blocks() operations Dan Williams @ 2015-11-17 20:16 ` Dan Williams 2015-11-17 20:16 ` [PATCH 8/8] dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() Dan Williams 7 siblings, 0 replies; 20+ messages in thread From: Dan Williams @ 2015-11-17 20:16 UTC (permalink / raw) To: linux-nvdimm; +Cc: linux-block, willy, ross.zwisler, linux-fsdevel, akpm If a ->direct_access() implementation ever returns a map count less than PAGE_SIZE, catch the error in bdev_direct_access(). This simplifies error checking in upper layers. Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- fs/block_dev.c | 2 ++ fs/dax.c | 1 - 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index bb0dfb1c7af1..e2cc681d4afe 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -475,6 +475,8 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector, avail = ops->direct_access(bdev, sector, addr, pfn); if (!avail) return -ERANGE; + if (avail > 0 && avail & ~PAGE_MASK) + return -ENXIO; return min(avail, size); } EXPORT_SYMBOL_GPL(bdev_direct_access); diff --git a/fs/dax.c b/fs/dax.c index e11d88835bb2..6e498c2570bf 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -52,7 +52,6 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size) sz = min_t(long, count, SZ_1M); clear_pmem(addr, sz); size -= sz; - BUG_ON(sz & 511); sector += sz / 512; cond_resched(); } while (size); ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 8/8] dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() 2015-11-17 20:15 [PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking Dan Williams ` (6 preceding siblings ...) 2015-11-17 20:16 ` [PATCH 7/8] dax: guarantee page aligned results from bdev_direct_access() Dan Williams @ 2015-11-17 20:16 ` Dan Williams 7 siblings, 0 replies; 20+ messages in thread From: Dan Williams @ 2015-11-17 20:16 UTC (permalink / raw) To: linux-nvdimm Cc: linux-block, Dave Chinner, Jens Axboe, Jeff Moyer, Jan Kara, linux-fsdevel, willy, ross.zwisler, akpm The DAX implementation needs to protect new calls to ->direct_access() and usage of its return value against the driver for the underlying block device being disabled. Use blk_queue_enter()/blk_queue_exit() to hold off blk_cleanup_queue() from proceeding, or otherwise fail new mapping requests if the request_queue is being torn down. This also introduces blk_dax_ctl to simplify the interface from fs/dax.c through dax_map_atomic() to bdev_direct_access(). Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@fb.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- block/blk.h | 2 fs/block_dev.c | 13 +-- fs/dax.c | 207 ++++++++++++++++++++++++++++++------------------ include/linux/blkdev.h | 19 ++++ 4 files changed, 151 insertions(+), 90 deletions(-) diff --git a/block/blk.h b/block/blk.h index da722eb786df..c43926d3d74d 100644 --- a/block/blk.h +++ b/block/blk.h @@ -72,8 +72,6 @@ void blk_dequeue_request(struct request *rq); void __blk_queue_free_tags(struct request_queue *q); bool __blk_end_bidi_request(struct request *rq, int error, unsigned int nr_bytes, unsigned int bidi_bytes); -int blk_queue_enter(struct request_queue *q, gfp_t gfp); -void blk_queue_exit(struct request_queue *q); void blk_freeze_queue(struct request_queue *q); static inline void blk_queue_enter_live(struct request_queue *q) diff --git a/fs/block_dev.c b/fs/block_dev.c index e2cc681d4afe..228c313c0ac1 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -436,10 +436,7 @@ EXPORT_SYMBOL_GPL(bdev_write_page); /** * bdev_direct_access() - Get the address for directly-accessibly memory * @bdev: The device containing the memory - * @sector: The offset within the device - * @addr: Where to put the address of the memory - * @pfn: The Page Frame Number for the memory - * @size: The number of bytes requested + * @dax: control and output parameters for ->direct_access * * If a block device is made up of directly addressable memory, this function * will tell the caller the PFN and the address of the memory. The address @@ -450,10 +447,10 @@ EXPORT_SYMBOL_GPL(bdev_write_page); * Return: negative errno if an error occurs, otherwise the number of bytes * accessible at this address. */ -long bdev_direct_access(struct block_device *bdev, sector_t sector, - void __pmem **addr, unsigned long *pfn, long size) +long bdev_direct_access(struct block_device *bdev, struct blk_dax_ctl *dax) { - long avail; + sector_t sector = dax->sector; + long avail, size = dax->size; const struct block_device_operations *ops = bdev->bd_disk->fops; /* @@ -472,7 +469,7 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector, sector += get_start_sect(bdev); if (sector % (PAGE_SIZE / 512)) return -EINVAL; - avail = ops->direct_access(bdev, sector, addr, pfn); + avail = ops->direct_access(bdev, sector, &dax->addr, &dax->pfn); if (!avail) return -ERANGE; if (avail > 0 && avail & ~PAGE_MASK) diff --git a/fs/dax.c b/fs/dax.c index 6e498c2570bf..a6e29d5ad6fd 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -30,45 +30,65 @@ #include <linux/vmstat.h> #include <linux/sizes.h> +static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax) +{ + struct request_queue *q = bdev->bd_queue; + long rc = -EIO; + + dax->addr = (void __pmem *) ERR_PTR(-EIO); + if (blk_queue_enter(q, GFP_NOWAIT) != 0) + return rc; + + rc = bdev_direct_access(bdev, dax); + if (rc < 0) { + dax->addr = (void __pmem *) ERR_PTR(rc); + blk_queue_exit(q); + return rc; + } + return rc; +} + +static void dax_unmap_atomic(struct block_device *bdev, + const struct blk_dax_ctl *dax) +{ + if (IS_ERR(dax->addr)) + return; + blk_queue_exit(bdev->bd_queue); +} + /* * dax_clear_blocks() is called from within transaction context from XFS, * and hence this means the stack from this point must follow GFP_NOFS * semantics for all operations. */ -int dax_clear_blocks(struct inode *inode, sector_t block, long size) +int dax_clear_blocks(struct inode *inode, sector_t block, long _size) { struct block_device *bdev = inode->i_sb->s_bdev; - sector_t sector = block << (inode->i_blkbits - 9); + struct blk_dax_ctl dax = { + .sector = block << (inode->i_blkbits - 9), + .size = _size, + }; might_sleep(); do { - void __pmem *addr; - unsigned long pfn; long count, sz; - count = bdev_direct_access(bdev, sector, &addr, &pfn, size); + count = dax_map_atomic(bdev, &dax); if (count < 0) return count; sz = min_t(long, count, SZ_1M); - clear_pmem(addr, sz); - size -= sz; - sector += sz / 512; + clear_pmem(dax.addr, sz); + dax.size -= sz; + dax.sector += sz / 512; + dax_unmap_atomic(bdev, &dax); cond_resched(); - } while (size); + } while (dax.size); wmb_pmem(); return 0; } EXPORT_SYMBOL_GPL(dax_clear_blocks); -static long dax_get_addr(struct buffer_head *bh, void __pmem **addr, - unsigned blkbits) -{ - unsigned long pfn; - sector_t sector = bh->b_blocknr << (blkbits - 9); - return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size); -} - /* the clear_pmem() calls are ordered by a wmb_pmem() in the caller */ static void dax_new_buf(void __pmem *addr, unsigned size, unsigned first, loff_t pos, loff_t end) @@ -98,19 +118,29 @@ static bool buffer_size_valid(struct buffer_head *bh) return bh->b_state != 0; } + +static sector_t to_sector(const struct buffer_head *bh, + const struct inode *inode) +{ + sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9); + + return sector; +} + static ssize_t dax_io(struct inode *inode, struct iov_iter *iter, loff_t start, loff_t end, get_block_t get_block, struct buffer_head *bh) { - ssize_t retval = 0; - loff_t pos = start; - loff_t max = start; - loff_t bh_max = start; - void __pmem *addr; - bool hole = false; - bool need_wmb = false; - - if (iov_iter_rw(iter) != WRITE) + loff_t pos = start, max = start, bh_max = start; + bool hole = false, need_wmb = false; + struct block_device *bdev = NULL; + int rw = iov_iter_rw(iter), rc; + long map_len = 0; + struct blk_dax_ctl dax = { + .addr = (void __pmem *) ERR_PTR(-EIO), + }; + + if (rw == READ) end = min(end, i_size_read(inode)); while (pos < end) { @@ -125,13 +155,13 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter, if (pos == bh_max) { bh->b_size = PAGE_ALIGN(end - pos); bh->b_state = 0; - retval = get_block(inode, block, bh, - iov_iter_rw(iter) == WRITE); - if (retval) + rc = get_block(inode, block, bh, rw == WRITE); + if (rc) break; if (!buffer_size_valid(bh)) bh->b_size = 1 << blkbits; bh_max = pos - first + bh->b_size; + bdev = bh->b_bdev; } else { unsigned done = bh->b_size - (bh_max - (pos - first)); @@ -139,47 +169,52 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter, bh->b_size -= done; } - hole = iov_iter_rw(iter) != WRITE && !buffer_written(bh); + hole = rw == READ && !buffer_written(bh); if (hole) { - addr = NULL; size = bh->b_size - first; } else { - retval = dax_get_addr(bh, &addr, blkbits); - if (retval < 0) + dax_unmap_atomic(bdev, &dax); + dax.sector = to_sector(bh, inode); + dax.size = bh->b_size; + map_len = dax_map_atomic(bdev, &dax); + if (map_len < 0) { + rc = map_len; break; + } if (buffer_unwritten(bh) || buffer_new(bh)) { - dax_new_buf(addr, retval, first, pos, - end); + dax_new_buf(dax.addr, map_len, first, + pos, end); need_wmb = true; } - addr += first; - size = retval - first; + dax.addr += first; + size = map_len - first; } max = min(pos + size, end); } if (iov_iter_rw(iter) == WRITE) { - len = copy_from_iter_pmem(addr, max - pos, iter); + len = copy_from_iter_pmem(dax.addr, max - pos, iter); need_wmb = true; } else if (!hole) - len = copy_to_iter((void __force *)addr, max - pos, + len = copy_to_iter((void __force *) dax.addr, max - pos, iter); else len = iov_iter_zero(max - pos, iter); if (!len) { - retval = -EFAULT; + rc = -EFAULT; break; } pos += len; - addr += len; + dax.addr += len; } if (need_wmb) wmb_pmem(); + dax_unmap_atomic(bdev, &dax); - return (pos == start) ? retval : pos - start; + return (pos == start) ? rc : pos - start; } /** @@ -268,28 +303,35 @@ static int dax_load_hole(struct address_space *mapping, struct page *page, return VM_FAULT_LOCKED; } -static int copy_user_bh(struct page *to, struct buffer_head *bh, - unsigned blkbits, unsigned long vaddr) +static int copy_user_bh(struct page *to, struct inode *inode, + struct buffer_head *bh, unsigned long vaddr) { - void __pmem *vfrom; + struct blk_dax_ctl dax = { + .sector = to_sector(bh, inode), + .size = bh->b_size, + }; + struct block_device *bdev = bh->b_bdev; void *vto; - if (dax_get_addr(bh, &vfrom, blkbits) < 0) - return -EIO; + if (dax_map_atomic(bdev, &dax) < 0) + return PTR_ERR(dax.addr); vto = kmap_atomic(to); - copy_user_page(vto, (void __force *)vfrom, vaddr, to); + copy_user_page(vto, (void __force *)dax.addr, vaddr, to); kunmap_atomic(vto); + dax_unmap_atomic(bdev, &dax); return 0; } static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, struct vm_area_struct *vma, struct vm_fault *vmf) { - struct address_space *mapping = inode->i_mapping; - sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9); unsigned long vaddr = (unsigned long)vmf->virtual_address; - void __pmem *addr; - unsigned long pfn; + struct address_space *mapping = inode->i_mapping; + struct block_device *bdev = bh->b_bdev; + struct blk_dax_ctl dax = { + .sector = to_sector(bh, inode), + .size = bh->b_size, + }; pgoff_t size; int error; @@ -308,20 +350,18 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, goto out; } - error = bdev_direct_access(bh->b_bdev, sector, &addr, &pfn, bh->b_size); - if (error < 0) - goto out; - if (error < PAGE_SIZE) { - error = -EIO; + if (dax_map_atomic(bdev, &dax) < 0) { + error = PTR_ERR(dax.addr); goto out; } if (buffer_unwritten(bh) || buffer_new(bh)) { - clear_pmem(addr, PAGE_SIZE); + clear_pmem(dax.addr, PAGE_SIZE); wmb_pmem(); } + dax_unmap_atomic(bdev, &dax); - error = vm_insert_mixed(vma, vaddr, pfn); + error = vm_insert_mixed(vma, vaddr, dax.pfn); out: i_mmap_unlock_read(mapping); @@ -415,7 +455,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, if (vmf->cow_page) { struct page *new_page = vmf->cow_page; if (buffer_written(&bh)) - error = copy_user_bh(new_page, &bh, blkbits, vaddr); + error = copy_user_bh(new_page, inode, &bh, vaddr); else clear_user_highpage(new_page, vaddr); if (error) @@ -527,11 +567,9 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, unsigned blkbits = inode->i_blkbits; unsigned long pmd_addr = address & PMD_MASK; bool write = flags & FAULT_FLAG_WRITE; - long length; - void __pmem *kaddr; + struct block_device *bdev; pgoff_t size, pgoff; - sector_t block, sector; - unsigned long pfn; + sector_t block; int result = 0; /* dax pmd mappings are broken wrt gup and fork */ @@ -559,9 +597,9 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, block = (sector_t)pgoff << (PAGE_SHIFT - blkbits); bh.b_size = PMD_SIZE; - length = get_block(inode, block, &bh, write); - if (length) + if (get_block(inode, block, &bh, write) != 0) return VM_FAULT_SIGBUS; + bdev = bh.b_bdev; i_mmap_lock_read(mapping); /* @@ -616,32 +654,40 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, result = VM_FAULT_NOPAGE; spin_unlock(ptl); } else { - sector = bh.b_blocknr << (blkbits - 9); - length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn, - bh.b_size); + struct blk_dax_ctl dax = { + .sector = to_sector(&bh, inode), + .size = PMD_SIZE, + }; + long length = dax_map_atomic(bdev, &dax); + if (length < 0) { result = VM_FAULT_SIGBUS; goto out; } - if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) + if ((length < PMD_SIZE) || (dax.pfn & PG_PMD_COLOUR)) { + dax_unmap_atomic(bdev, &dax); goto fallback; + } /* * TODO: teach vmf_insert_pfn_pmd() to support * 'pte_special' for pmds */ - if (pfn_valid(pfn)) + if (pfn_valid(dax.pfn)) { + dax_unmap_atomic(bdev, &dax); goto fallback; + } if (buffer_unwritten(&bh) || buffer_new(&bh)) { - clear_pmem(kaddr, PMD_SIZE); + clear_pmem(dax.addr, PMD_SIZE); wmb_pmem(); count_vm_event(PGMAJFAULT); mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); result |= VM_FAULT_MAJOR; } + dax_unmap_atomic(bdev, &dax); - result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write); + result |= vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write); } out: @@ -743,12 +789,17 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length, if (err < 0) return err; if (buffer_written(&bh)) { - void __pmem *addr; - err = dax_get_addr(&bh, &addr, inode->i_blkbits); - if (err < 0) - return err; - clear_pmem(addr + offset, length); + struct block_device *bdev = bh.b_bdev; + struct blk_dax_ctl dax = { + .sector = to_sector(&bh, inode), + .size = PAGE_CACHE_SIZE, + }; + + if (dax_map_atomic(bdev, &dax) < 0) + return PTR_ERR(dax.addr); + clear_pmem(dax.addr + offset, length); wmb_pmem(); + dax_unmap_atomic(bdev, &dax); } return 0; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 3fe27f8d91f0..8aa53454ce27 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -794,6 +794,8 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t, extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t, struct scsi_ioctl_command __user *); +extern int blk_queue_enter(struct request_queue *q, gfp_t gfp); +extern void blk_queue_exit(struct request_queue *q); extern void blk_start_queue(struct request_queue *q); extern void blk_stop_queue(struct request_queue *q); extern void blk_sync_queue(struct request_queue *q); @@ -1615,6 +1617,20 @@ static inline bool integrity_req_gap_front_merge(struct request *req, #endif /* CONFIG_BLK_DEV_INTEGRITY */ +/** + * struct blk_dax_ctl - control and output parameters for ->direct_access + * @sector: (input) offset relative to a block_device + * @addr: (output) kernel virtual address for @sector populated by driver + * @pfn: (output) page frame number for @addr populated by driver + * @size: (input) number of bytes requested + */ +struct blk_dax_ctl { + sector_t sector; + void __pmem *addr; + long size; + unsigned long pfn; +}; + struct block_device_operations { int (*open) (struct block_device *, fmode_t); void (*release) (struct gendisk *, fmode_t); @@ -1641,8 +1657,7 @@ extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int, extern int bdev_read_page(struct block_device *, sector_t, struct page *); extern int bdev_write_page(struct block_device *, sector_t, struct page *, struct writeback_control *); -extern long bdev_direct_access(struct block_device *, sector_t, - void __pmem **addr, unsigned long *pfn, long size); +extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *); #else /* CONFIG_BLOCK */ struct block_device; ^ permalink raw reply related [flat|nested] 20+ messages in thread
end of thread, other threads:[~2015-11-20 17:08 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-11-17 20:15 [PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking Dan Williams 2015-11-17 20:15 ` [PATCH 1/8] ext2, ext4: warn when mounting with dax enabled Dan Williams 2015-11-17 20:16 ` [PATCH 2/8] dax: disable pmd mappings Dan Williams 2015-11-17 20:51 ` Ross Zwisler 2015-11-17 20:16 ` [PATCH 3/8] mm, dax: fix DAX deadlocks (COW fault) Dan Williams 2015-11-17 20:16 ` [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown Dan Williams 2015-11-18 15:09 ` Jan Kara 2015-11-19 0:22 ` Williams, Dan J 2015-11-19 12:55 ` Jan Kara 2015-11-19 16:55 ` Dan Williams 2015-11-19 17:12 ` Jan Kara 2015-11-19 23:17 ` Dave Chinner 2015-11-20 0:05 ` Williams, Dan J 2015-11-20 4:06 ` Dave Chinner 2015-11-20 4:25 ` Dan Williams 2015-11-20 17:08 ` Dan Williams 2015-11-17 20:16 ` [PATCH 5/8] pmem, dax: clean up clear_pmem() Dan Williams 2015-11-17 20:16 ` [PATCH 6/8] dax: increase granularity of dax_clear_blocks() operations Dan Williams 2015-11-17 20:16 ` [PATCH 7/8] dax: guarantee page aligned results from bdev_direct_access() Dan Williams 2015-11-17 20:16 ` [PATCH 8/8] dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() Dan Williams
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).