From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: [RFC 00/11] DAX fsynx/msync support Date: Thu, 29 Oct 2015 14:12:04 -0600 Message-ID: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox To: linux-kernel@vger.kernel.org Return-path: Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org This patch series adds support for fsync/msync to DAX. Patches 1 through 8 add various utilities that the DAX code will eventually need, and the DAX code itself is added by patch 9. Patches 10 and 11 are filesystem changes that are needed after the DAX code is added, but these patches may change slightly as the filesystem fault handling for DAX is being modified ([1] and [2]). I've marked this series as RFC because I'm still testing, but I wanted to get this out there so people would see the direction I was going and hopefully comment on any big red flags sooner rather than later. I realize that we are getting pretty dang close to the v4.4 merge window, but I think that if we can get this reviewed and working it's a much better solution than the "big hammer" approach that blindly flushes entire PMEM namespaces [3]. [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2 [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html Ross Zwisler (11): pmem: add wb_cache_pmem() to the PMEM API mm: add pmd_mkclean() pmem: enable REQ_FLUSH handling dax: support dirty DAX entries in radix tree mm: add follow_pte_pmd() mm: add pgoff_mkclean() mm: add find_get_entries_tag() fs: add get_block() to struct inode_operations dax: add support for fsync/sync xfs, ext2: call dax_pfn_mkwrite() on write fault ext4: add ext4_dax_pfn_mkwrite() arch/x86/include/asm/pgtable.h | 5 ++ arch/x86/include/asm/pmem.h | 11 +-- drivers/nvdimm/pmem.c | 3 +- fs/dax.c | 161 +++++++++++++++++++++++++++++++++++++++-- fs/ext2/file.c | 5 +- fs/ext4/file.c | 23 +++++- fs/inode.c | 1 + fs/xfs/xfs_file.c | 9 ++- fs/xfs/xfs_iops.c | 1 + include/linux/dax.h | 6 ++ include/linux/fs.h | 5 +- include/linux/mm.h | 2 + include/linux/pagemap.h | 3 + include/linux/pmem.h | 22 +++++- include/linux/radix-tree.h | 3 + include/linux/rmap.h | 5 ++ mm/filemap.c | 73 ++++++++++++++++++- mm/huge_memory.c | 14 ++-- mm/memory.c | 41 +++++++++-- mm/page-writeback.c | 9 +++ mm/rmap.c | 53 ++++++++++++++ mm/truncate.c | 5 +- 22 files changed, 418 insertions(+), 42 deletions(-) -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: [RFC 01/11] pmem: add wb_cache_pmem() to the PMEM API Date: Thu, 29 Oct 2015 14:12:05 -0600 Message-ID: <1446149535-16200-2-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox To: linux-kernel@vger.kernel.org Return-path: In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org The function __arch_wb_cache_pmem() was already an internal implementation detail of the x86 PMEM API, but this functionality needs to be exported as part of the general PMEM API to handle the fsync/msync case for DAX mmaps. Signed-off-by: Ross Zwisler --- arch/x86/include/asm/pmem.h | 11 ++++++----- include/linux/pmem.h | 22 +++++++++++++++++++++- 2 files changed, 27 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h index d8ce3ec..6c7ade0 100644 --- a/arch/x86/include/asm/pmem.h +++ b/arch/x86/include/asm/pmem.h @@ -67,18 +67,19 @@ static inline void arch_wmb_pmem(void) } /** - * __arch_wb_cache_pmem - write back a cache range with CLWB + * arch_wb_cache_pmem - write back a cache range with CLWB * @vaddr: virtual start address * @size: number of bytes to write back * * Write back a cache range using the CLWB (cache line write back) * instruction. This function requires explicit ordering with an - * arch_wmb_pmem() call. This API is internal to the x86 PMEM implementation. + * arch_wmb_pmem() call. */ -static inline void __arch_wb_cache_pmem(void *vaddr, size_t size) +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size) { u16 x86_clflush_size = boot_cpu_data.x86_clflush_size; unsigned long clflush_mask = x86_clflush_size - 1; + void *vaddr = (void __force *)addr; void *vend = vaddr + size; void *p; @@ -115,7 +116,7 @@ static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes, len = copy_from_iter_nocache(vaddr, bytes, i); if (__iter_needs_pmem_wb(i)) - __arch_wb_cache_pmem(vaddr, bytes); + arch_wb_cache_pmem(addr, bytes); return len; } @@ -138,7 +139,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) else memset(vaddr, 0, size); - __arch_wb_cache_pmem(vaddr, size); + arch_wb_cache_pmem(addr, size); } static inline bool __arch_has_wmb_pmem(void) diff --git a/include/linux/pmem.h b/include/linux/pmem.h index 85f810b3..2cd5003 100644 --- a/include/linux/pmem.h +++ b/include/linux/pmem.h @@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) { BUG(); } + +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size) +{ + BUG(); +} #endif /* * Architectures that define ARCH_HAS_PMEM_API must provide * implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(), - * arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem(). + * arch_copy_from_iter_pmem(), arch_clear_pmem(), arch_wb_cache_pmem() + * and arch_has_wmb_pmem(). */ static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size) { @@ -202,4 +208,18 @@ static inline void clear_pmem(void __pmem *addr, size_t size) else default_clear_pmem(addr, size); } + +/** + * wb_cache_pmem - write back processor cache for PMEM memory range + * @addr: virtual start address + * @size: number of bytes to write back + * + * Write back the processor cache range starting at 'addr' for 'size' bytes. + * This function requires explicit ordering with a wmb_pmem() call. + */ +static inline void wb_cache_pmem(void __pmem *addr, size_t size) +{ + if (arch_has_pmem_api()) + arch_wb_cache_pmem(addr, size); +} #endif /* __PMEM_H__ */ -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: [RFC 05/11] mm: add follow_pte_pmd() Date: Thu, 29 Oct 2015 14:12:09 -0600 Message-ID: <1446149535-16200-6-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox To: linux-kernel@vger.kernel.org Return-path: In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org Similar to follow_pte(), follow_pte_pmd() allows either a PTE leaf or a huge page PMD leaf to be found and returned. Signed-off-by: Ross Zwisler --- include/linux/mm.h | 2 ++ mm/memory.c | 41 +++++++++++++++++++++++++++++++++-------- 2 files changed, 35 insertions(+), 8 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 80001de..393441c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1166,6 +1166,8 @@ int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma); void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows); +int follow_pte_pmd(struct mm_struct *mm, unsigned long address, + pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp); int follow_pfn(struct vm_area_struct *vma, unsigned long address, unsigned long *pfn); int follow_phys(struct vm_area_struct *vma, unsigned long address, diff --git a/mm/memory.c b/mm/memory.c index deb679c..c2b8c0a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3512,8 +3512,8 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) } #endif /* __PAGETABLE_PMD_FOLDED */ -static int __follow_pte(struct mm_struct *mm, unsigned long address, - pte_t **ptepp, spinlock_t **ptlp) +static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address, + pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp) { pgd_t *pgd; pud_t *pud; @@ -3529,12 +3529,23 @@ static int __follow_pte(struct mm_struct *mm, unsigned long address, goto out; pmd = pmd_offset(pud, address); - VM_BUG_ON(pmd_trans_huge(*pmd)); - if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) - goto out; - /* We cannot handle huge page PFN maps. Luckily they don't exist. */ - if (pmd_huge(*pmd)) + if (pmd_huge(*pmd)) { + if (!pmdpp) + goto out; + + *ptlp = pmd_lock(mm, pmd); + if (pmd_huge(*pmd)) { + /* Success, we found a large PTE */ + *pmdpp = pmd; + return 0; + } + /* Somebody removed the PMD entry, try it as a pte */ + spin_unlock(*ptlp); + } + + /* FIXME: pmd_bad() is sometimes set for DAX pmds? */ + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) goto out; ptep = pte_offset_map_lock(mm, pmd, address, ptlp); @@ -3557,9 +3568,23 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address, /* (void) is needed to make gcc happy */ (void) __cond_lock(*ptlp, - !(res = __follow_pte(mm, address, ptepp, ptlp))); + !(res = __follow_pte_pmd(mm, address, ptepp, NULL, + ptlp))); + return res; +} + +int follow_pte_pmd(struct mm_struct *mm, unsigned long address, + pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp) +{ + int res; + + /* (void) is needed to make gcc happy */ + (void) __cond_lock(*ptlp, + !(res = __follow_pte_pmd(mm, address, ptepp, pmdpp, + ptlp))); return res; } +EXPORT_SYMBOL(follow_pte_pmd); /** * follow_pfn - look up PFN at a user virtual address -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: [RFC 03/11] pmem: enable REQ_FLUSH handling Date: Thu, 29 Oct 2015 14:12:07 -0600 Message-ID: <1446149535-16200-4-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox To: linux-kernel@vger.kernel.org Return-path: In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org Currently the PMEM driver doesn't accept REQ_FLUSH bios. These are sent down via blkdev_issue_flush() in response to a fsync() or msync(). When we get an msync() or fsync() it is the responsibility of the DAX code to flush all dirty pages to media. The PMEM driver then just has issue a wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all the flushed data has been durably stored on the media. Signed-off-by: Ross Zwisler --- drivers/nvdimm/pmem.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 0ba6a97..e1e222e 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -80,7 +80,7 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio) if (do_acct) nd_iostat_end(bio, start); - if (bio_data_dir(bio)) + if (bio_data_dir(bio) || (bio->bi_rw & REQ_FLUSH)) wmb_pmem(); bio_endio(bio); @@ -189,6 +189,7 @@ static int pmem_attach_disk(struct device *dev, blk_queue_physical_block_size(pmem->pmem_queue, PAGE_SIZE); blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX); blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY); + blk_queue_flush(pmem->pmem_queue, REQ_FLUSH); queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem->pmem_queue); disk = alloc_disk(0); -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: [RFC 02/11] mm: add pmd_mkclean() Date: Thu, 29 Oct 2015 14:12:06 -0600 Message-ID: <1446149535-16200-3-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox To: linux-kernel@vger.kernel.org Return-path: In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org Currently PMD pages can be dirtied via pmd_mkdirty(), but cannot be cleaned. For DAX mmap dirty page tracking we need to be able to clean PMD pages when we flush them to media so that we get a new write fault the next time the are written to. Signed-off-by: Ross Zwisler --- arch/x86/include/asm/pgtable.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 867da5b..c548e4c 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -277,6 +277,11 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd) return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); } +static inline pmd_t pmd_mkclean(pmd_t pmd) +{ + return pmd_clear_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); +} + static inline pmd_t pmd_mkhuge(pmd_t pmd) { return pmd_set_flags(pmd, _PAGE_PSE); -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: [RFC 04/11] dax: support dirty DAX entries in radix tree Date: Thu, 29 Oct 2015 14:12:08 -0600 Message-ID: <1446149535-16200-5-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox To: linux-kernel@vger.kernel.org Return-path: In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org Add support for tracking dirty DAX entries in the struct address_space radix tree. This tree is already used for dirty page writeback, and it already supports the use of exceptional (non struct page*) entries. In order to properly track dirty DAX pages we will insert new exceptional entries into the radix tree that represent dirty DAX PTE or PMD pages. There are currently two types of exceptional entries (shmem and shadow) that can be placed into the radix tree, and this adds a third. There shouldn't be any collisions between these various exceptional entries because only one type of exceptional entry should be able to be found in a radix tree at a time depending on how it is being used. Signed-off-by: Ross Zwisler --- fs/inode.c | 1 + include/linux/dax.h | 5 +++++ include/linux/fs.h | 1 + include/linux/radix-tree.h | 3 +++ mm/filemap.c | 12 ++++++++---- mm/truncate.c | 5 +++-- 6 files changed, 21 insertions(+), 6 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 78a17b8..f7c87a6 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -496,6 +496,7 @@ void clear_inode(struct inode *inode) spin_lock_irq(&inode->i_data.tree_lock); BUG_ON(inode->i_data.nrpages); BUG_ON(inode->i_data.nrshadows); + BUG_ON(inode->i_data.nrdax); spin_unlock_irq(&inode->i_data.tree_lock); BUG_ON(!list_empty(&inode->i_data.private_list)); BUG_ON(!(inode->i_state & I_FREEING)); diff --git a/include/linux/dax.h b/include/linux/dax.h index b415e52..e9d57f68 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma) { return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); } + +static inline bool dax_mapping(struct address_space *mapping) +{ + return mapping->host && IS_DAX(mapping->host); +} #endif diff --git a/include/linux/fs.h b/include/linux/fs.h index 72d8a84..f791698 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -433,6 +433,7 @@ struct address_space { /* Protected by tree_lock together with the radix tree */ unsigned long nrpages; /* number of total pages */ unsigned long nrshadows; /* number of shadow entries */ + unsigned long nrdax; /* number of DAX entries */ pgoff_t writeback_index;/* writeback starts here */ const struct address_space_operations *a_ops; /* methods */ unsigned long flags; /* error bits/gfp mask */ diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index 33170db..fabec66 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -51,6 +51,9 @@ #define RADIX_TREE_EXCEPTIONAL_ENTRY 2 #define RADIX_TREE_EXCEPTIONAL_SHIFT 2 +#define RADIX_TREE_DAX_PTE ((void *)(0x10 | RADIX_TREE_EXCEPTIONAL_ENTRY)) +#define RADIX_TREE_DAX_PMD ((void *)(0x20 | RADIX_TREE_EXCEPTIONAL_ENTRY)) + static inline int radix_tree_is_indirect_ptr(void *ptr) { return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR); diff --git a/mm/filemap.c b/mm/filemap.c index 327910c..c3a9e4f 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -11,6 +11,7 @@ */ #include #include +#include #include #include #include @@ -440,7 +441,7 @@ int filemap_write_and_wait_range(struct address_space *mapping, { int err = 0; - if (mapping->nrpages) { + if (mapping->nrpages || mapping->nrdax) { err = __filemap_fdatawrite_range(mapping, lstart, lend, WB_SYNC_ALL); /* See comment of filemap_write_and_wait() */ @@ -538,6 +539,9 @@ static int page_cache_tree_insert(struct address_space *mapping, p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock); if (!radix_tree_exceptional_entry(p)) return -EEXIST; + + BUG_ON(dax_mapping(mapping)); + if (shadowp) *shadowp = p; mapping->nrshadows--; @@ -1201,9 +1205,9 @@ repeat: if (radix_tree_deref_retry(page)) goto restart; /* - * A shadow entry of a recently evicted page, - * or a swap entry from shmem/tmpfs. Return - * it without attempting to raise page count. + * A shadow entry of a recently evicted page, a swap + * entry from shmem/tmpfs or a DAX entry. Return it + * without attempting to raise page count. */ goto export; } diff --git a/mm/truncate.c b/mm/truncate.c index 76e35ad..cdf44a0 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -9,6 +9,7 @@ #include #include +#include #include #include #include @@ -29,8 +30,8 @@ static void clear_exceptional_entry(struct address_space *mapping, struct radix_tree_node *node; void **slot; - /* Handled by shmem itself */ - if (shmem_mapping(mapping)) + /* Handled by shmem or DAX directly */ + if (shmem_mapping(mapping) || dax_mapping(mapping)) return; spin_lock_irq(&mapping->tree_lock); -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: [RFC 09/11] dax: add support for fsync/sync Date: Thu, 29 Oct 2015 14:12:13 -0600 Message-ID: <1446149535-16200-10-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox To: linux-kernel@vger.kernel.org Return-path: In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org To properly handle fsync/msync in an efficient way DAX needs to track dirty pages so it is able to flush them durably to media on demand. The tracking of dirty pages is done via the radix tree in struct address_space. This radix tree is already used by the page writeback infrastructure for tracking dirty pages associated with an open file, and it already has support for exceptional (non struct page*) entries. We build upon these features to add exceptional entries to the radix tree for DAX dirty PMD or PTE pages at fault time. When called as part of the msync/fsync flush path DAX queries the radix tree for dirty entries, flushing them and then marking the PTE or PMD page table entries as clean. The step of cleaning the PTE or PMD entries is necessary so that on subsequent writes to the same page we get a new write fault, allowing us to re-enter the DAX tag into the radix tree. Signed-off-by: Ross Zwisler --- fs/dax.c | 161 ++++++++++++++++++++++++++++++++++++++++++++++++++-- include/linux/dax.h | 1 + mm/huge_memory.c | 14 ++--- mm/page-writeback.c | 9 +++ 4 files changed, 172 insertions(+), 13 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 131fd35a..3b38aff 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -24,10 +24,13 @@ #include #include #include +#include #include +#include #include #include #include +#include /* * dax_clear_blocks() is called from within transaction context from XFS, @@ -287,6 +290,42 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh, return 0; } +static int dax_dirty_pgoff(struct vm_area_struct *vma, + struct address_space *mapping, unsigned long pgoff, + bool pmd_entry) +{ + struct radix_tree_root *page_tree = &mapping->page_tree; + void *tag; + int error = 0; + + __mark_inode_dirty(file_inode(vma->vm_file), I_DIRTY_PAGES); + + spin_lock_irq(&mapping->tree_lock); + + tag = radix_tree_lookup(page_tree, pgoff); + if (tag) { + if (pmd_entry && tag == RADIX_TREE_DAX_PTE) { + radix_tree_delete(&mapping->page_tree, pgoff); + mapping->nrdax--; + } else + goto out; + } + + if (pmd_entry) + error = radix_tree_insert(page_tree, pgoff, RADIX_TREE_DAX_PMD); + else + error = radix_tree_insert(page_tree, pgoff, RADIX_TREE_DAX_PTE); + + if (error) + goto out; + + mapping->nrdax++; + radix_tree_tag_set(page_tree, pgoff, PAGECACHE_TAG_DIRTY); + out: + spin_unlock_irq(&mapping->tree_lock); + return error; +} + static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, struct vm_area_struct *vma, struct vm_fault *vmf) { @@ -450,6 +489,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, delete_from_page_cache(page); unlock_page(page); page_cache_release(page); + page = NULL; } /* @@ -463,6 +503,13 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, * as for normal BH based IO completions. */ error = dax_insert_mapping(inode, &bh, vma, vmf); + if (error) + goto out; + + error = dax_dirty_pgoff(vma, inode->i_mapping, vmf->pgoff, false); + if (error) + goto out; + if (buffer_unwritten(&bh)) { if (complete_unwritten) complete_unwritten(&bh, !error); @@ -537,7 +584,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, pgoff_t size, pgoff; sector_t block, sector; unsigned long pfn; - int result = 0; + int error, result = 0; /* Fall back to PTEs if we're going to COW */ if (write && !(vma->vm_flags & VM_SHARED)) @@ -638,6 +685,10 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, } result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write); + + error = dax_dirty_pgoff(vma, inode->i_mapping, pgoff, true); + if (error) + goto fallback; } out: @@ -693,11 +744,11 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault); */ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) { - struct super_block *sb = file_inode(vma->vm_file)->i_sb; + struct file *file = vma->vm_file; + struct address_space *mapping = file->f_mapping; + struct inode *inode = mapping->host; - sb_start_pagefault(sb); - file_update_time(vma->vm_file); - sb_end_pagefault(sb); + dax_dirty_pgoff(vma, inode->i_mapping, vmf->pgoff, false); return VM_FAULT_NOPAGE; } EXPORT_SYMBOL_GPL(dax_pfn_mkwrite); @@ -772,3 +823,103 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) return dax_zero_page_range(inode, from, length, get_block); } EXPORT_SYMBOL_GPL(dax_truncate_page); + +static int dax_flush_one_mapping(struct address_space *mapping, + struct inode *inode, sector_t block, void *tag) +{ + get_block_t *get_block = inode->i_op->get_block; + struct buffer_head bh; + void __pmem *addr; + int ret; + + BUG_ON(tag != RADIX_TREE_DAX_PMD && tag != RADIX_TREE_DAX_PTE); + + memset(&bh, 0, sizeof(bh)); + + if (tag == RADIX_TREE_DAX_PMD) + bh.b_size = PMD_SIZE; + else + bh.b_size = PAGE_SIZE; + + ret = get_block(inode, block, &bh, false); + BUG_ON(!buffer_written(&bh)); + if (ret < 0) + return ret; + + ret = dax_get_addr(&bh, &addr, inode->i_blkbits); + if (ret < 0) + return ret; + + if (tag == RADIX_TREE_DAX_PMD) + WARN_ON(ret != PMD_SIZE); + else + WARN_ON(ret != PAGE_SIZE); + + wb_cache_pmem(addr, ret); + + spin_lock_irq(&mapping->tree_lock); + radix_tree_delete(&mapping->page_tree, block); + spin_unlock_irq(&mapping->tree_lock); + mapping->nrdax--; + + return pgoff_mkclean(block, mapping); +} + +/* + * flush the mapping to the persistent domain within the byte range of (start, + * end). This is required by data integrity operations to ensure file data is on + * persistent storage prior to completion of the operation. It also requires us + * to clean the mappings (i.e. write -> RO) so that we'll get a new fault when + * the file is written to again so wehave an indication that we need to flush + * the mapping if a data integrity operation takes place. + * + * We don't need commits to storage here - the filesystems will issue flushes + * appropriately at the conclusion of the data integrity operation via REQ_FUA + * writes or blkdev_issue_flush() commands. This requires the DAX block device + * to implement persistent storage domain fencing/commits on receiving a + * REQ_FLUSH or REQ_FUA request so that this works as expected by the higher + * layers. + */ +int dax_flush_mapping(struct address_space *mapping, loff_t start, loff_t end) +{ + struct inode *inode = mapping->host; + pgoff_t indices[PAGEVEC_SIZE]; + struct pagevec pvec; + int i, error; + + pgoff_t start_page = start >> PAGE_CACHE_SHIFT; + pgoff_t end_page = end >> PAGE_CACHE_SHIFT; + + + if (mapping->nrdax == 0) + return 0; + + if (!inode->i_op->get_block) { + WARN_ONCE(1, "Flushing DAX mapping without get_block()!"); + mapping->nrdax = 0; + return 0; + } + + BUG_ON(inode->i_blkbits != PAGE_SHIFT); + + tag_pages_for_writeback(mapping, start_page, end_page); + + pagevec_init(&pvec, 0); + while (1) { + pvec.nr = find_get_entries_tag(mapping, start_page, + PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE, + pvec.pages, indices); + + if (pvec.nr == 0) + break; + + for (i = 0; i < pvec.nr; i++) { + error = dax_flush_one_mapping(mapping, inode, + indices[i], pvec.pages[i]); + if (error) + return error; + } + } + + return 0; +} diff --git a/include/linux/dax.h b/include/linux/dax.h index e9d57f68..5eff476 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -41,4 +41,5 @@ static inline bool dax_mapping(struct address_space *mapping) { return mapping->host && IS_DAX(mapping->host); } +int dax_flush_mapping(struct address_space *mapping, loff_t start, loff_t end); #endif diff --git a/mm/huge_memory.c b/mm/huge_memory.c index bbac913..1b3df56 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -877,15 +877,13 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, spinlock_t *ptl; ptl = pmd_lock(mm, pmd); - if (pmd_none(*pmd)) { - entry = pmd_mkhuge(pfn_pmd(pfn, prot)); - if (write) { - entry = pmd_mkyoung(pmd_mkdirty(entry)); - entry = maybe_pmd_mkwrite(entry, vma); - } - set_pmd_at(mm, addr, pmd, entry); - update_mmu_cache_pmd(vma, addr, pmd); + entry = pmd_mkhuge(pfn_pmd(pfn, prot)); + if (write) { + entry = pmd_mkyoung(pmd_mkdirty(entry)); + entry = maybe_pmd_mkwrite(entry, vma); } + set_pmd_at(mm, addr, pmd, entry); + update_mmu_cache_pmd(vma, addr, pmd); spin_unlock(ptl); } diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2c90357..1801df8 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -37,6 +37,7 @@ #include #include #include +#include #include #include "internal.h" @@ -2338,6 +2339,14 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc) if (wbc->nr_to_write <= 0) return 0; + + if (wbc->sync_mode == WB_SYNC_ALL && dax_mapping(mapping)) { + ret = dax_flush_mapping(mapping, wbc->range_start, + wbc->range_end); + if (ret) + return ret; + } + if (mapping->a_ops->writepages) ret = mapping->a_ops->writepages(mapping, wbc); else -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: [RFC 08/11] fs: add get_block() to struct inode_operations Date: Thu, 29 Oct 2015 14:12:12 -0600 Message-ID: <1446149535-16200-9-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox To: linux-kernel@vger.kernel.org Return-path: In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org To be able to flush dirty pages to media as part of the fsync/msync path DAX needs to be able to map file offsets to kernel addresses via a combination of the filesystem's get_block() routine and bdev_direct_access(). This currently happens in the DAX fault handlers which receive a get_block() callback directly from the filesystem via a function parameter. For the fsync/msync path this doesn't work, though, because DAX is called not by the filesystem but by the writeback infrastructure which doesn't know about the filesystem specific get_block() routine. To handle this we make get_block() an entry in the struct inode_operations table so that we can access the correct get_block() routine in the context of the writeback infrastructure. Signed-off-by: Ross Zwisler --- fs/ext2/file.c | 1 + fs/ext4/file.c | 1 + fs/xfs/xfs_iops.c | 1 + include/linux/fs.h | 4 ++-- 4 files changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 11a42c5..fc1418c 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -202,4 +202,5 @@ const struct inode_operations ext2_file_inode_operations = { .get_acl = ext2_get_acl, .set_acl = ext2_set_acl, .fiemap = ext2_fiemap, + .get_block = ext2_get_block, }; diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 113837e..54d7729 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -720,5 +720,6 @@ const struct inode_operations ext4_file_inode_operations = { .get_acl = ext4_get_acl, .set_acl = ext4_set_acl, .fiemap = ext4_fiemap, + .get_block = ext4_get_block_dax, }; diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 8294132..c58c270 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -1112,6 +1112,7 @@ static const struct inode_operations xfs_inode_operations = { .listxattr = xfs_vn_listxattr, .fiemap = xfs_vn_fiemap, .update_time = xfs_vn_update_time, + .get_block = xfs_get_blocks_direct, }; static const struct inode_operations xfs_dir_inode_operations = { diff --git a/include/linux/fs.h b/include/linux/fs.h index f791698..1dca85b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1679,8 +1679,8 @@ struct inode_operations { umode_t create_mode, int *opened); int (*tmpfile) (struct inode *, struct dentry *, umode_t); int (*set_acl)(struct inode *, struct posix_acl *, int); - - /* WARNING: probably going away soon, do not use! */ + int (*get_block)(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create); } ____cacheline_aligned; ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector, -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: [RFC 06/11] mm: add pgoff_mkclean() Date: Thu, 29 Oct 2015 14:12:10 -0600 Message-ID: <1446149535-16200-7-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox To: linux-kernel@vger.kernel.org Return-path: In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org Introduce pgoff_mkclean() which conceptually is similar to page_mkclean() except it works in the absence of struct page and it can also be used to clean PMDs. This is needed for DAX's dirty page handling. Signed-off-by: Ross Zwisler --- include/linux/rmap.h | 5 +++++ mm/rmap.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 58 insertions(+) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 29446ae..627875f9 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -223,6 +223,11 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *); int page_mkclean(struct page *); /* + * Cleans and write protects the PTEs of shared mappings. + */ +int pgoff_mkclean(pgoff_t, struct address_space *); + +/* * called in munlock()/munmap() path to check for other vmas holding * the page mlocked. */ diff --git a/mm/rmap.c b/mm/rmap.c index f5b5c1f..0ce16ab 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -586,6 +586,16 @@ vma_address(struct page *page, struct vm_area_struct *vma) return address; } +static inline unsigned long +pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma) +{ + unsigned long address; + + address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); + VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); + return address; +} + #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH static void percpu_flush_tlb_batch_pages(void *data) { @@ -1040,6 +1050,49 @@ int page_mkclean(struct page *page) } EXPORT_SYMBOL_GPL(page_mkclean); +int pgoff_mkclean(pgoff_t pgoff, struct address_space *mapping) +{ + struct vm_area_struct *vma; + int ret = 0; + + i_mmap_lock_read(mapping); + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { + struct mm_struct *mm = vma->vm_mm; + pmd_t pmd, *pmdp = NULL; + pte_t pte, *ptep = NULL; + unsigned long address; + spinlock_t *ptl; + + address = pgoff_address(pgoff, vma); + + ret = follow_pte_pmd(mm, address, &ptep, &pmdp, &ptl); + if (ret) + goto out; + + if (pmdp) { + flush_cache_page(vma, address, pmd_pfn(*pmdp)); + pmd = pmdp_huge_clear_flush(vma, address, pmdp); + pmd = pmd_wrprotect(pmd); + pmd = pmd_mkclean(pmd); + set_pmd_at(mm, address, pmdp, pmd); + spin_unlock(ptl); + } else { + BUG_ON(!ptep); + flush_cache_page(vma, address, pte_pfn(*ptep)); + pte = ptep_clear_flush(vma, address, ptep); + pte = pte_wrprotect(pte); + pte = pte_mkclean(pte); + set_pte_at(mm, address, ptep, pte); + pte_unmap_unlock(ptep, ptl); + } + } + + out: + i_mmap_unlock_read(mapping); + return ret; +} +EXPORT_SYMBOL_GPL(pgoff_mkclean); + /** * page_move_anon_rmap - move a page to our anon_vma * @page: the page to move to our anon_vma -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: [RFC 07/11] mm: add find_get_entries_tag() Date: Thu, 29 Oct 2015 14:12:11 -0600 Message-ID: <1446149535-16200-8-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox To: linux-kernel@vger.kernel.org Return-path: In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org Add find_get_entries_tag() to the family of functions that include find_get_entries(), find_get_pages() and find_get_pages_tag(). This is needed for DAX dirty page handling because we need a list of both page offsets and radix tree entries ('indices' and 'entries' in this function) that are marked with the PAGECACHE_TAG_TOWRITE tag. Signed-off-by: Ross Zwisler --- include/linux/pagemap.h | 3 +++ mm/filemap.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 64 insertions(+) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index a6c78e0..6fea3be 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -354,6 +354,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start, unsigned int nr_pages, struct page **pages); unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, int tag, unsigned int nr_pages, struct page **pages); +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start, + int tag, unsigned int nr_entries, + struct page **entries, pgoff_t *indices); struct page *grab_cache_page_write_begin(struct address_space *mapping, pgoff_t index, unsigned flags); diff --git a/mm/filemap.c b/mm/filemap.c index c3a9e4f..992cf84 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1454,6 +1454,67 @@ repeat: } EXPORT_SYMBOL(find_get_pages_tag); +/** + * find_get_entries_tag - find and return entries that match @tag + * @mapping: the address_space to search + * @start: the starting page cache index + * @tag: the tag index + * @nr_entries: the maximum number of entries + * @entries: where the resulting entries are placed + * @indices: the cache indices corresponding to the entries in @entries + * + * Like find_get_entries, except we only return entries which are tagged with + * @tag. + */ +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start, + int tag, unsigned int nr_entries, + struct page **entries, pgoff_t *indices) +{ + void **slot; + unsigned int ret = 0; + struct radix_tree_iter iter; + + if (!nr_entries) + return 0; + + rcu_read_lock(); +restart: + radix_tree_for_each_tagged(slot, &mapping->page_tree, + &iter, start, tag) { + struct page *page; +repeat: + page = radix_tree_deref_slot(slot); + if (unlikely(!page)) + continue; + if (radix_tree_exception(page)) { + if (radix_tree_deref_retry(page)) + goto restart; + /* + * A shadow entry of a recently evicted page, a swap + * entry from shmem/tmpfs or a DAX entry. Return it + * without attempting to raise page count. + */ + goto export; + } + if (!page_cache_get_speculative(page)) + goto repeat; + + /* Has the page moved? */ + if (unlikely(page != *slot)) { + page_cache_release(page); + goto repeat; + } +export: + indices[ret] = iter.index; + entries[ret] = page; + if (++ret == nr_entries) + break; + } + rcu_read_unlock(); + return ret; +} +EXPORT_SYMBOL(find_get_entries_tag); + /* * CD/DVDs are error prone. When a medium error occurs, the driver may fail * a _large_ part of the i/o request. Imagine the worst scenario: -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: [RFC 10/11] xfs, ext2: call dax_pfn_mkwrite() on write fault Date: Thu, 29 Oct 2015 14:12:14 -0600 Message-ID: <1446149535-16200-11-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox To: linux-kernel@vger.kernel.org Return-path: In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org Previously the functionality offered by dax_pfn_mkwrite() was open-coded in XFS and ext2. With the addition of DAX msync/fsync support we need to call dax_pfn_mkwrite() so that the newly writeable page can be added to the radix tree and marked as dirty. Signed-off-by: Ross Zwisler --- fs/ext2/file.c | 4 +++- fs/xfs/xfs_file.c | 9 +++++---- 2 files changed, 8 insertions(+), 5 deletions(-) diff --git a/fs/ext2/file.c b/fs/ext2/file.c index fc1418c..5741e29 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -102,8 +102,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma, { struct inode *inode = file_inode(vma->vm_file); struct ext2_inode_info *ei = EXT2_I(inode); - int ret = VM_FAULT_NOPAGE; loff_t size; + int ret; sb_start_pagefault(inode->i_sb); file_update_time(vma->vm_file); @@ -113,6 +113,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma, size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; if (vmf->pgoff >= size) ret = VM_FAULT_SIGBUS; + else + ret = dax_pfn_mkwrite(vma, vmf); up_read(&ei->dax_sem); sb_end_pagefault(inode->i_sb); diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index f429662..584e4eb 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1575,9 +1575,8 @@ xfs_filemap_pmd_fault( /* * pfn_mkwrite was originally inteneded to ensure we capture time stamp * updates on write faults. In reality, it's need to serialise against - * truncate similar to page_mkwrite. Hence we open-code dax_pfn_mkwrite() - * here and cycle the XFS_MMAPLOCK_SHARED to ensure we serialise the fault - * barrier in place. + * truncate similar to page_mkwrite. Hence we cycle the XFS_MMAPLOCK_SHARED + * to ensure we serialise the fault barrier in place. */ static int xfs_filemap_pfn_mkwrite( @@ -1587,7 +1586,7 @@ xfs_filemap_pfn_mkwrite( struct inode *inode = file_inode(vma->vm_file); struct xfs_inode *ip = XFS_I(inode); - int ret = VM_FAULT_NOPAGE; + int ret; loff_t size; trace_xfs_filemap_pfn_mkwrite(ip); @@ -1600,6 +1599,8 @@ xfs_filemap_pfn_mkwrite( size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; if (vmf->pgoff >= size) ret = VM_FAULT_SIGBUS; + else + ret = dax_pfn_mkwrite(vma, vmf); xfs_iunlock(ip, XFS_MMAPLOCK_SHARED); sb_end_pagefault(inode->i_sb); return ret; -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: [RFC 11/11] ext4: add ext4_dax_pfn_mkwrite() Date: Thu, 29 Oct 2015 14:12:15 -0600 Message-ID: <1446149535-16200-12-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox To: linux-kernel@vger.kernel.org Return-path: In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org Add ext4_dax_pfn_mkwrite() that properly protects against freeze, sets the file update time and calls into dax_pfn_mkwrite() to add the newly dirtied page to the radix tree which tracks dirty DAX pages. Signed-off-by: Ross Zwisler --- fs/ext4/file.c | 22 +++++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 54d7729..7f62cad 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -272,11 +272,31 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) ext4_end_io_unwritten); } +static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct inode *inode = file_inode(vma->vm_file); + loff_t size; + int ret; + + sb_start_pagefault(inode->i_sb); + file_update_time(vma->vm_file); + + /* check that the faulting page hasn't raced with truncate */ + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; + if (vmf->pgoff >= size) + ret = VM_FAULT_SIGBUS; + else + ret = dax_pfn_mkwrite(vma, vmf); + + sb_end_pagefault(inode->i_sb); + return ret; +} + static const struct vm_operations_struct ext4_dax_vm_ops = { .fault = ext4_dax_fault, .pmd_fault = ext4_dax_pmd_fault, .page_mkwrite = ext4_dax_mkwrite, - .pfn_mkwrite = dax_pfn_mkwrite, + .pfn_mkwrite = ext4_dax_pfn_mkwrite, }; #else #define ext4_dax_vm_ops ext4_file_vm_ops -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: Re: [RFC 00/11] DAX fsynx/msync support Date: Thu, 29 Oct 2015 16:49:53 -0600 Message-ID: <20151029224953.GA17933@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox To: Ross Zwisler , Dave Chinner Return-path: Content-Disposition: inline In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org On Thu, Oct 29, 2015 at 02:12:04PM -0600, Ross Zwisler wrote: > This patch series adds support for fsync/msync to DAX. > > Patches 1 through 8 add various utilities that the DAX code will eventually > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are > filesystem changes that are needed after the DAX code is added, but these > patches may change slightly as the filesystem fault handling for DAX is > being modified ([1] and [2]). > > I've marked this series as RFC because I'm still testing, but I wanted to > get this out there so people would see the direction I was going and > hopefully comment on any big red flags sooner rather than later. > > I realize that we are getting pretty dang close to the v4.4 merge window, > but I think that if we can get this reviewed and working it's a much better > solution than the "big hammer" approach that blindly flushes entire PMEM > namespaces [3]. > > [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html > [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2 > [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html Hmm...I think I may need to isolate the fsync/msync flushing against races with truncate since we are calling into the filesystem directly with get_block(). Dave (Chinner), does this sound right? Also, one thing I forgot to mention is that these patches are built upon the first version of Dave Chinner's XFS patches and my ext2 patches that deal with the truncate races with DAX. A snapshot of my development tree with these patches applied can be found here: https://github.com/01org/prd/tree/fsync_rfc -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Chinner Subject: Re: [RFC 00/11] DAX fsynx/msync support Date: Fri, 30 Oct 2015 14:55:33 +1100 Message-ID: <20151030035533.GU19199@dastard> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dan Williams , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox To: Ross Zwisler Return-path: Content-Disposition: inline In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org On Thu, Oct 29, 2015 at 02:12:04PM -0600, Ross Zwisler wrote: > This patch series adds support for fsync/msync to DAX. > > Patches 1 through 8 add various utilities that the DAX code will eventually > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are > filesystem changes that are needed after the DAX code is added, but these > patches may change slightly as the filesystem fault handling for DAX is > being modified ([1] and [2]). > > I've marked this series as RFC because I'm still testing, but I wanted to > get this out there so people would see the direction I was going and > hopefully comment on any big red flags sooner rather than later. > > I realize that we are getting pretty dang close to the v4.4 merge window, > but I think that if we can get this reviewed and working it's a much better > solution than the "big hammer" approach that blindly flushes entire PMEM > namespaces [3]. We need the "big hammer" regardless of fsync. If REQ_FLUSH and REQ_FUA don't do the right thing when it comes to ordering journal writes against other IO operations, then the filesystems are not crash safe. i.e. we need REQ_FLUSH/REQ_FUA to commit all outstanding changes back to stable storage, just like they do for existing storage.... > [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html > [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2 > [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html > > Ross Zwisler (11): > pmem: add wb_cache_pmem() to the PMEM API > mm: add pmd_mkclean() > pmem: enable REQ_FLUSH handling > dax: support dirty DAX entries in radix tree > mm: add follow_pte_pmd() > mm: add pgoff_mkclean() > mm: add find_get_entries_tag() > fs: add get_block() to struct inode_operations I don't think this is the right thing to do - it propagates the use of bufferheads as a mapping structure into places where we do not want bufferheads. We've recently added a similar block mapping interface to the export operations structure for PNFS and that uses a "struct iomap" which is far more suited to being an inode operation this. We have plans to move this to the inode operations for various reasons. e.g: multipage write, adding interfaces that support proper mapping of holes, etc: https://www.redhat.com/archives/cluster-devel/2014-October/msg00167.html So after many years of saying no to moving getblocks to the inode operations it seems like the wrong thing to do now considering I want to convert all the DAX code to use iomaps while only 2/3 filesystems are supported... > dax: add support for fsync/sync Why put the dax_flush_mapping() in do_writepages()? Why not call it directly from the filesystem ->fsync() implementations where a getblocks callback could also be provided? Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dan Williams Subject: Re: [RFC 00/11] DAX fsynx/msync support Date: Fri, 30 Oct 2015 11:34:07 -0700 Message-ID: References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Cc: "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox To: Ross Zwisler Return-path: In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org On Thu, Oct 29, 2015 at 1:12 PM, Ross Zwisler wrote: > This patch series adds support for fsync/msync to DAX. > > Patches 1 through 8 add various utilities that the DAX code will eventually > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are > filesystem changes that are needed after the DAX code is added, but these > patches may change slightly as the filesystem fault handling for DAX is > being modified ([1] and [2]). > > I've marked this series as RFC because I'm still testing, but I wanted to > get this out there so people would see the direction I was going and > hopefully comment on any big red flags sooner rather than later. > > I realize that we are getting pretty dang close to the v4.4 merge window, > but I think that if we can get this reviewed and working it's a much better > solution than the "big hammer" approach that blindly flushes entire PMEM > namespaces [3]. > > [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html > [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2 > [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html > > Ross Zwisler (11): > pmem: add wb_cache_pmem() to the PMEM API > mm: add pmd_mkclean() > pmem: enable REQ_FLUSH handling > dax: support dirty DAX entries in radix tree > mm: add follow_pte_pmd() > mm: add pgoff_mkclean() > mm: add find_get_entries_tag() > fs: add get_block() to struct inode_operations > dax: add support for fsync/sync > xfs, ext2: call dax_pfn_mkwrite() on write fault > ext4: add ext4_dax_pfn_mkwrite() This is great to have when the flush-the-world solution ends up killing performance. However, there are a couple mitigating options for workloads that dirty small amounts and flush often that we need to collect data on: 1/ Using cache management and pcommit from userspace to skip calls to msync / fsync. Although, this does not eliminate all calls to blkdev_issue_flush as the fs may invoke it for other reasons. I suspect turning on REQ_FUA support eliminates a number of those invocations, and pmem already satisfies REQ_FUA semantics by default. 2/ Turn off DAX and use the page cache. As Dave mentions [1] we should enable this control on a per-inode basis. I'm folding in this capability as a blkdev_ioctl for the next version of the raw block DAX support patch. It's entirely possible these mitigations won't eliminate the need for a mechanism like this, but I think we have a bit more work to do to find out how bad this is in practice as well as the crossover point where walking the radix becomes prohibitive. We also have the option of tracking open DAX extents in the driver. Even at coarse granularities I'd be surprised if we can't mitigate most of the overhead. [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-October/002598.html -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: Re: [RFC 00/11] DAX fsynx/msync support Date: Fri, 30 Oct 2015 12:39:38 -0600 Message-ID: <20151030183938.GC24643@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dan Williams , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox To: Dave Chinner Return-path: Content-Disposition: inline In-Reply-To: <20151030035533.GU19199@dastard> Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org On Fri, Oct 30, 2015 at 02:55:33PM +1100, Dave Chinner wrote: > On Thu, Oct 29, 2015 at 02:12:04PM -0600, Ross Zwisler wrote: > > This patch series adds support for fsync/msync to DAX. > > > > Patches 1 through 8 add various utilities that the DAX code will eventually > > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are > > filesystem changes that are needed after the DAX code is added, but these > > patches may change slightly as the filesystem fault handling for DAX is > > being modified ([1] and [2]). > > > > I've marked this series as RFC because I'm still testing, but I wanted to > > get this out there so people would see the direction I was going and > > hopefully comment on any big red flags sooner rather than later. > > > > I realize that we are getting pretty dang close to the v4.4 merge window, > > but I think that if we can get this reviewed and working it's a much better > > solution than the "big hammer" approach that blindly flushes entire PMEM > > namespaces [3]. > > We need the "big hammer" regardless of fsync. If REQ_FLUSH and > REQ_FUA don't do the right thing when it comes to ordering journal > writes against other IO operations, then the filesystems are not > crash safe. i.e. we need REQ_FLUSH/REQ_FUA to commit all outstanding > changes back to stable storage, just like they do for existing > storage.... I think that what I've got here (when it's fully working) will protect all the cases that we need. AFAIK there are three ways that data can be written to a PMEM namespace: 1) Through the PMEM driver via either pmem_make_request(), pmem_rw_page() or pmem_rw_bytes(). All of these paths sync the newly written data durably to media before the I/O completes so they shouldn't have any reliance on REQ_FUA/REQ_FLUSH. 2) Through the DAX I/O path, dax_io(). As with PMEM we flush the newly written data durably to media before the I/O operation completes, so this path shouldn't have any reliance on REQ_FUA/REQ_FLUSH. 3) Through mmaps set up by DAX. This is the path we are trying to protect with the dirty page tracking and flushing in this patch set, and I think that this is the only path that has reliance on REQ_FLUSH. The goal of this set is to have the cache writeback all happen as part of the fsync/msync handling, and then have the REQ_FLUSH just provide the trailing wmb_pmem(). My guess is that XFS metadata writes happen via path 1), down through the PMEM driver. Am I missing anything, or should we be good to go? > > [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html > > [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2 > > [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html > > > > Ross Zwisler (11): > > pmem: add wb_cache_pmem() to the PMEM API > > mm: add pmd_mkclean() > > pmem: enable REQ_FLUSH handling > > dax: support dirty DAX entries in radix tree > > mm: add follow_pte_pmd() > > mm: add pgoff_mkclean() > > mm: add find_get_entries_tag() > > fs: add get_block() to struct inode_operations > > I don't think this is the right thing to do - it propagates the use > of bufferheads as a mapping structure into places where we do not > want bufferheads. We've recently added a similar block mapping > interface to the export operations structure for PNFS and that uses > a "struct iomap" which is far more suited to being an inode > operation this. > > We have plans to move this to the inode operations for various > reasons. e.g: multipage write, adding interfaces that support proper > mapping of holes, etc: > > https://www.redhat.com/archives/cluster-devel/2014-October/msg00167.html > > So after many years of saying no to moving getblocks to the inode > operations it seems like the wrong thing to do now considering I > want to convert all the DAX code to use iomaps while only 2/3 > filesystems are supported... Okay, I'll take a look at this interface. I also think that we may need to flow through the filesystem before going into the DAX code so that we can serialize our flushing with respect to extent manipulation, as we had to do with our DAX fault paths. > > dax: add support for fsync/sync > > Why put the dax_flush_mapping() in do_writepages()? Why not call it > directly from the filesystem ->fsync() implementations where a > getblocks callback could also be provided? Because that's where you put it in your example. :) https://lists.01.org/pipermail/linux-nvdimm/2015-March/000341.html Moving it into the filesystem where we know about get_block() is probably the right thing to do - I'll check it out. Thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: Re: [RFC 00/11] DAX fsynx/msync support Date: Fri, 30 Oct 2015 13:43:00 -0600 Message-ID: <20151030194300.GA22670@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox To: Dan Williams Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org On Fri, Oct 30, 2015 at 11:34:07AM -0700, Dan Williams wrote: > On Thu, Oct 29, 2015 at 1:12 PM, Ross Zwisler > wrote: > > This patch series adds support for fsync/msync to DAX. > > > > Patches 1 through 8 add various utilities that the DAX code will eventually > > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are > > filesystem changes that are needed after the DAX code is added, but these > > patches may change slightly as the filesystem fault handling for DAX is > > being modified ([1] and [2]). > > > > I've marked this series as RFC because I'm still testing, but I wanted to > > get this out there so people would see the direction I was going and > > hopefully comment on any big red flags sooner rather than later. > > > > I realize that we are getting pretty dang close to the v4.4 merge window, > > but I think that if we can get this reviewed and working it's a much better > > solution than the "big hammer" approach that blindly flushes entire PMEM > > namespaces [3]. > > > > [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html > > [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2 > > [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html > > > > Ross Zwisler (11): > > pmem: add wb_cache_pmem() to the PMEM API > > mm: add pmd_mkclean() > > pmem: enable REQ_FLUSH handling > > dax: support dirty DAX entries in radix tree > > mm: add follow_pte_pmd() > > mm: add pgoff_mkclean() > > mm: add find_get_entries_tag() > > fs: add get_block() to struct inode_operations > > dax: add support for fsync/sync > > xfs, ext2: call dax_pfn_mkwrite() on write fault > > ext4: add ext4_dax_pfn_mkwrite() > > This is great to have when the flush-the-world solution ends up > killing performance. However, there are a couple mitigating options > for workloads that dirty small amounts and flush often that we need to > collect data on: > > 1/ Using cache management and pcommit from userspace to skip calls to > msync / fsync. Although, this does not eliminate all calls to > blkdev_issue_flush as the fs may invoke it for other reasons. I > suspect turning on REQ_FUA support eliminates a number of those > invocations, and pmem already satisfies REQ_FUA semantics by default. Sure, I'll turn on REQ_FUA in addition to REQ_FLUSH - I agree that PMEM already handles the requirements of REQ_FUA, but I didn't realize that it might reduce the number of REQ_FLUSH bios we receive. > 2/ Turn off DAX and use the page cache. As Dave mentions [1] we > should enable this control on a per-inode basis. I'm folding in this > capability as a blkdev_ioctl for the next version of the raw block DAX > support patch. Umm...I think you just said "the way to avoid this delay is to just not use DAX". :) I don't think this is where we want to go - we are trying to make DAX better, not abandon it. > It's entirely possible these mitigations won't eliminate the need for > a mechanism like this, but I think we have a bit more work to do to > find out how bad this is in practice as well as the crossover point > where walking the radix becomes prohibitive. I'm guessing a single run through xfstests will be enough to convince you that the "big hammer" approach is untenable. Tests that used to take a second now take several minutes, at least in my VM testing environment... And that's only using a tiny 4GiB namespace. Yes, we can distribute the cost over multiple CPUs, but that just distributes the problem and doesn't reduce the overall work that needs to be done. Ultimately I think that looping through multiple GiB or even TiB of cache lines and blindly writing them back individually on every REQ_FLUSH is going to be a deal breaker. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dan Williams Subject: Re: [RFC 00/11] DAX fsynx/msync support Date: Fri, 30 Oct 2015 12:51:40 -0700 Message-ID: References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030194300.GA22670@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 To: Ross Zwisler , Dan Williams , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Return-path: In-Reply-To: <20151030194300.GA22670@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org On Fri, Oct 30, 2015 at 12:43 PM, Ross Zwisler wrote: > On Fri, Oct 30, 2015 at 11:34:07AM -0700, Dan Williams wrote: >> On Thu, Oct 29, 2015 at 1:12 PM, Ross Zwisler >> wrote: >> > This patch series adds support for fsync/msync to DAX. >> > >> > Patches 1 through 8 add various utilities that the DAX code will eventually >> > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are >> > filesystem changes that are needed after the DAX code is added, but these >> > patches may change slightly as the filesystem fault handling for DAX is >> > being modified ([1] and [2]). >> > >> > I've marked this series as RFC because I'm still testing, but I wanted to >> > get this out there so people would see the direction I was going and >> > hopefully comment on any big red flags sooner rather than later. >> > >> > I realize that we are getting pretty dang close to the v4.4 merge window, >> > but I think that if we can get this reviewed and working it's a much better >> > solution than the "big hammer" approach that blindly flushes entire PMEM >> > namespaces [3]. >> > >> > [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html >> > [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2 >> > [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html >> > >> > Ross Zwisler (11): >> > pmem: add wb_cache_pmem() to the PMEM API >> > mm: add pmd_mkclean() >> > pmem: enable REQ_FLUSH handling >> > dax: support dirty DAX entries in radix tree >> > mm: add follow_pte_pmd() >> > mm: add pgoff_mkclean() >> > mm: add find_get_entries_tag() >> > fs: add get_block() to struct inode_operations >> > dax: add support for fsync/sync >> > xfs, ext2: call dax_pfn_mkwrite() on write fault >> > ext4: add ext4_dax_pfn_mkwrite() >> >> This is great to have when the flush-the-world solution ends up >> killing performance. However, there are a couple mitigating options >> for workloads that dirty small amounts and flush often that we need to >> collect data on: >> >> 1/ Using cache management and pcommit from userspace to skip calls to >> msync / fsync. Although, this does not eliminate all calls to >> blkdev_issue_flush as the fs may invoke it for other reasons. I >> suspect turning on REQ_FUA support eliminates a number of those >> invocations, and pmem already satisfies REQ_FUA semantics by default. > > Sure, I'll turn on REQ_FUA in addition to REQ_FLUSH - I agree that PMEM > already handles the requirements of REQ_FUA, but I didn't realize that it > might reduce the number of REQ_FLUSH bios we receive. I'll let Dave chime in, but a lot of the flush requirements come from guaranteeing the state of the metadata, if metadata updates can be done with REQ_FUA then there is no subsequent need to flush. >> 2/ Turn off DAX and use the page cache. As Dave mentions [1] we >> should enable this control on a per-inode basis. I'm folding in this >> capability as a blkdev_ioctl for the next version of the raw block DAX >> support patch. > > Umm...I think you just said "the way to avoid this delay is to just not use > DAX". :) I don't think this is where we want to go - we are trying to make > DAX better, not abandon it. That's a bit of an exaggeration. Avoiding DAX where it is not necessary is not "abandoning DAX", it's using the right tool for the job. Page cache is fine for many cases. >> It's entirely possible these mitigations won't eliminate the need for >> a mechanism like this, but I think we have a bit more work to do to >> find out how bad this is in practice as well as the crossover point >> where walking the radix becomes prohibitive. > > I'm guessing a single run through xfstests will be enough to convince you that > the "big hammer" approach is untenable. Tests that used to take a second now > take several minutes, at least in my VM testing environment... And that's > only using a tiny 4GiB namespace. > > Yes, we can distribute the cost over multiple CPUs, but that just distributes > the problem and doesn't reduce the overall work that needs to be done. > Ultimately I think that looping through multiple GiB or even TiB of cache > lines and blindly writing them back individually on every REQ_FLUSH is going > to be a deal breaker. Right, part of the problem is that the driver doesn't know which blocks are actively DAX mapped. I think we can incrementally fix that without requiring DAX specific fsync/msync handling code for each fs that supports DAX. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Chinner Subject: Re: [RFC 00/11] DAX fsynx/msync support Date: Mon, 2 Nov 2015 10:29:48 +1100 Message-ID: <20151101232948.GF10656@dastard> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii To: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dan Williams , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <20151030183938.GC24643@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org On Fri, Oct 30, 2015 at 12:39:38PM -0600, Ross Zwisler wrote: > On Fri, Oct 30, 2015 at 02:55:33PM +1100, Dave Chinner wrote: > > On Thu, Oct 29, 2015 at 02:12:04PM -0600, Ross Zwisler wrote: > > > This patch series adds support for fsync/msync to DAX. > > > > > > Patches 1 through 8 add various utilities that the DAX code will eventually > > > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are > > > filesystem changes that are needed after the DAX code is added, but these > > > patches may change slightly as the filesystem fault handling for DAX is > > > being modified ([1] and [2]). > > > > > > I've marked this series as RFC because I'm still testing, but I wanted to > > > get this out there so people would see the direction I was going and > > > hopefully comment on any big red flags sooner rather than later. > > > > > > I realize that we are getting pretty dang close to the v4.4 merge window, > > > but I think that if we can get this reviewed and working it's a much better > > > solution than the "big hammer" approach that blindly flushes entire PMEM > > > namespaces [3]. > > > > We need the "big hammer" regardless of fsync. If REQ_FLUSH and > > REQ_FUA don't do the right thing when it comes to ordering journal > > writes against other IO operations, then the filesystems are not > > crash safe. i.e. we need REQ_FLUSH/REQ_FUA to commit all outstanding > > changes back to stable storage, just like they do for existing > > storage.... > > I think that what I've got here (when it's fully working) will protect all the > cases that we need. > > AFAIK there are three ways that data can be written to a PMEM namespace: > > 1) Through the PMEM driver via either pmem_make_request(), pmem_rw_page() or > pmem_rw_bytes(). All of these paths sync the newly written data durably to > media before the I/O completes so they shouldn't have any reliance on > REQ_FUA/REQ_FLUSH. I suspect that not all future pmem devices will use this driver/interface/semantics. Further, REQ_FLUSH/REQ_FUA are more than just "put the data on stable storage" commands. They are also IO barriers that affect scheduling of IOs in progress and in the request queues. A REQ_FLUSH/REQ_FUA IO cannot be dispatched before all prior IO has been dispatched and drained from the request queue, and IO submitted after a queued REQ_FLUSH/REQ_FUA cannot be scheduled ahead of the queued REQ_FLUSH/REQ_FUA operation. IOWs, REQ_FUA/REQ_FLUSH not only guarantee data is on stable storage, they also guarantee the order of IO dispatch and completion when concurrent IO is in progress. > 2) Through the DAX I/O path, dax_io(). As with PMEM we flush the newly > written data durably to media before the I/O operation completes, so this path > shouldn't have any reliance on REQ_FUA/REQ_FLUSH. That's fine, but that's not the problem we need solved ;) > 3) Through mmaps set up by DAX. This is the path we are trying to protect > with the dirty page tracking and flushing in this patch set, and I think that > this is the only path that has reliance on REQ_FLUSH. Quite possibly this is the case for the current intel pmem driver, but I don't look at the functionality from that perspective. Dirty page tracking is needed to enable "data writeback", whether it be CPU cachelines via pcommit() or dirty pages via submit_bio(). How the pages get dirty is irrelevant - the fact is they are dirty and we need to do /something/ to ensure they are correctly written back to the storage layer. REQ_FLUSH is needed to guarantee all data that has been written back to the storage layer is persistent in that layer. How a /driver/ manages that is up to the driver - the actual implementation is irrelevant to the higher layers. i.e. what we are concerned about at the filesystem level is that: a) "data writeback" is started correctly; b) the "data writeback" is completed; and c) volatile caches are completely flushed before we write the metadata changes that reference that data to the journal via FUA e.g. we could have pmem, but we are using buffered IO (i.e. non-DAX) and a hardware driver that doesn't flush CPU cachelines in the physical IO path. This requires that driver to flush CPU cachelines and place memory barriers in REQ_FLUSH operations, as well as after writing the data in REQ_FUA operations. Yes, this is different to the way the intel pmem drivers work (i.e. as noted in 1) above), but it is /not wrong/ as long as REQ_FLUSH/REQ_FUA also flush dirty cpu cachelines. IOWs, the high level code we write that implements fsync for DAX needs to be generic enough so that when something slightly different comes along we don't have to throw everything away and start again. I think your code will end up being generic enough to handle this, but let's make sure we don't implement something that can only work with pmem hardware/drivers that do all IO as fully synchronous to the stable domain... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Chinner Subject: Re: [RFC 00/11] DAX fsynx/msync support Date: Mon, 2 Nov 2015 10:36:32 +1100 Message-ID: <20151101233632.GG10656@dastard> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030194300.GA22670@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox To: Dan Williams Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org On Fri, Oct 30, 2015 at 12:51:40PM -0700, Dan Williams wrote: > On Fri, Oct 30, 2015 at 12:43 PM, Ross Zwisler > wrote: > > On Fri, Oct 30, 2015 at 11:34:07AM -0700, Dan Williams wrote: > >> This is great to have when the flush-the-world solution ends up > >> killing performance. However, there are a couple mitigating options > >> for workloads that dirty small amounts and flush often that we need to > >> collect data on: > >> > >> 1/ Using cache management and pcommit from userspace to skip calls to > >> msync / fsync. Although, this does not eliminate all calls to > >> blkdev_issue_flush as the fs may invoke it for other reasons. I > >> suspect turning on REQ_FUA support eliminates a number of those > >> invocations, and pmem already satisfies REQ_FUA semantics by default. > > > > Sure, I'll turn on REQ_FUA in addition to REQ_FLUSH - I agree that PMEM > > already handles the requirements of REQ_FUA, but I didn't realize that it > > might reduce the number of REQ_FLUSH bios we receive. > > I'll let Dave chime in, but a lot of the flush requirements come from > guaranteeing the state of the metadata, if metadata updates can be > done with REQ_FUA then there is no subsequent need to flush. No need for cache flushes in this case, but we still need the IO scheduler to order such operations correctly. > >> 2/ Turn off DAX and use the page cache. As Dave mentions [1] we > >> should enable this control on a per-inode basis. I'm folding in this > >> capability as a blkdev_ioctl for the next version of the raw block DAX > >> support patch. > > > > Umm...I think you just said "the way to avoid this delay is to just not use > > DAX". :) I don't think this is where we want to go - we are trying to make > > DAX better, not abandon it. > > That's a bit of an exaggeration. Avoiding DAX where it is not > necessary is not "abandoning DAX", it's using the right tool for the > job. Page cache is fine for many cases. Think btrfs - any file that uses COW can't use DAX for write. Everything has to be buffered, unless the nodatacow flag is set, and then DAX can be used. Indeed, on ext4 if you are using file encryption you can't use DAX. IOWs, we already know that we have to support mixed DAX/non-DAX access within the same filesystem, so I'm with Dan here... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff Moyer Subject: Re: [RFC 00/11] DAX fsynx/msync support Date: Mon, 02 Nov 2015 09:22:15 -0500 Message-ID: References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> <20151101232948.GF10656@dastard> Mime-Version: 1.0 Content-Type: text/plain Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox To: Dave Chinner Return-path: In-Reply-To: <20151101232948.GF10656@dastard> (Dave Chinner's message of "Mon, 2 Nov 2015 10:29:48 +1100") Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org Dave Chinner writes: > Further, REQ_FLUSH/REQ_FUA are more than just "put the data on stable > storage" commands. They are also IO barriers that affect scheduling > of IOs in progress and in the request queues. A REQ_FLUSH/REQ_FUA > IO cannot be dispatched before all prior IO has been dispatched and > drained from the request queue, and IO submitted after a queued > REQ_FLUSH/REQ_FUA cannot be scheduled ahead of the queued > REQ_FLUSH/REQ_FUA operation. > > IOWs, REQ_FUA/REQ_FLUSH not only guarantee data is on stable > storage, they also guarantee the order of IO dispatch and > completion when concurrent IO is in progress. This hasn't been the case for several years, now. It used to work that way, and that was deemed a big performance problem. Since file systems already issued and waited for all I/O before sending down a barrier, we decided to get rid of the I/O ordering pieces of barriers (and stop calling them barriers). See commit 28e7d184521 (block: drop barrier ordering by queue draining). Cheers, Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Chinner Subject: Re: [RFC 00/11] DAX fsynx/msync support Date: Tue, 3 Nov 2015 07:10:29 +1100 Message-ID: <20151102201029.GI10656@dastard> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> <20151101232948.GF10656@dastard> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dan Williams , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox To: Jeff Moyer Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org On Mon, Nov 02, 2015 at 09:22:15AM -0500, Jeff Moyer wrote: > Dave Chinner writes: > > > Further, REQ_FLUSH/REQ_FUA are more than just "put the data on stable > > storage" commands. They are also IO barriers that affect scheduling > > of IOs in progress and in the request queues. A REQ_FLUSH/REQ_FUA > > IO cannot be dispatched before all prior IO has been dispatched and > > drained from the request queue, and IO submitted after a queued > > REQ_FLUSH/REQ_FUA cannot be scheduled ahead of the queued > > REQ_FLUSH/REQ_FUA operation. > > > > IOWs, REQ_FUA/REQ_FLUSH not only guarantee data is on stable > > storage, they also guarantee the order of IO dispatch and > > completion when concurrent IO is in progress. > > This hasn't been the case for several years, now. It used to work that > way, and that was deemed a big performance problem. Since file systems > already issued and waited for all I/O before sending down a barrier, we > decided to get rid of the I/O ordering pieces of barriers (and stop > calling them barriers). > > See commit 28e7d184521 (block: drop barrier ordering by queue draining). Yes, I realise that, even if I wasn't very clear about how I wrote it. ;) Correct me if I'm wrong: AFAIA, dispatch ordering (i.e. the "IO barrier") is still enforced by the scheduler via REQ_FUA|REQ_FLUSH -> ELEVATOR_INSERT_FLUSH -> REQ_SOFTBARRIER and subsequent IO scheduler calls to elv_dispatch_sort() that don't pass REQ_SOFTBARRIER in the queue. IOWs, if we queue a bunch of REQ_WRITE IOs followed by a REQ_WRITE|REQ_FLUSH IO, all of the prior REQ_WRITE IOs will be dispatched before the REQ_WRITE|REQ_FLUSH IO and hence be captured by the cache flush. Hence once the filesystem has waited on the REQ_WRITE|REQ_FLUSH IO to complete, we know that all the earlier REQ_WRITE IOs are on stable storage, too. Hence there's no need for the elevator to drain the queue to guarantee completion ordering - the dispatch ordering and flush/fua write semantics guarantee that when the flush/fua completes, all the IOs dispatch prior to that flush/fua write are also on stable storage... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff Moyer Subject: Re: [RFC 00/11] DAX fsynx/msync support Date: Mon, 02 Nov 2015 16:02:48 -0500 Message-ID: References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> <20151101232948.GF10656@dastard> <20151102201029.GI10656@dastard> Mime-Version: 1.0 Content-Type: text/plain Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox , axboe@kernel.dk To: Dave Chinner Return-path: In-Reply-To: <20151102201029.GI10656@dastard> (Dave Chinner's message of "Tue, 3 Nov 2015 07:10:29 +1100") Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org Dave Chinner writes: > On Mon, Nov 02, 2015 at 09:22:15AM -0500, Jeff Moyer wrote: >> Dave Chinner writes: >> >> > Further, REQ_FLUSH/REQ_FUA are more than just "put the data on stable >> > storage" commands. They are also IO barriers that affect scheduling >> > of IOs in progress and in the request queues. A REQ_FLUSH/REQ_FUA >> > IO cannot be dispatched before all prior IO has been dispatched and >> > drained from the request queue, and IO submitted after a queued >> > REQ_FLUSH/REQ_FUA cannot be scheduled ahead of the queued >> > REQ_FLUSH/REQ_FUA operation. >> > >> > IOWs, REQ_FUA/REQ_FLUSH not only guarantee data is on stable >> > storage, they also guarantee the order of IO dispatch and >> > completion when concurrent IO is in progress. >> >> This hasn't been the case for several years, now. It used to work that >> way, and that was deemed a big performance problem. Since file systems >> already issued and waited for all I/O before sending down a barrier, we >> decided to get rid of the I/O ordering pieces of barriers (and stop >> calling them barriers). >> >> See commit 28e7d184521 (block: drop barrier ordering by queue draining). > > Yes, I realise that, even if I wasn't very clear about how I wrote > it. ;) > > Correct me if I'm wrong: AFAIA, dispatch ordering (i.e. the "IO > barrier") is still enforced by the scheduler via REQ_FUA|REQ_FLUSH > -> ELEVATOR_INSERT_FLUSH -> REQ_SOFTBARRIER and subsequent IO > scheduler calls to elv_dispatch_sort() that don't pass > REQ_SOFTBARRIER in the queue. This part is right. > IOWs, if we queue a bunch of REQ_WRITE IOs followed by a > REQ_WRITE|REQ_FLUSH IO, all of the prior REQ_WRITE IOs will be > dispatched before the REQ_WRITE|REQ_FLUSH IO and hence be captured > by the cache flush. But this part is not. It is up to the I/O scheduler to decide when to dispatch requests. It can hold on to them for a variety of reasons. Flush requests, however, do not go through the I/O scheduler. At the very moment that the flush request is inserted, it goes directly to the dispatch queue (assuming no other flush is in progress). The prior requests may still be waiting in the I/O scheduler's internal lists. So, any newly dispatched I/Os will certainly not get past the REQ_FLUSH. However, the REQ_FLUSH is very likely to jump ahead of prior I/Os in the queue. > Hence once the filesystem has waited on the REQ_WRITE|REQ_FLUSH IO > to complete, we know that all the earlier REQ_WRITE IOs are on > stable storage, too. Hence there's no need for the elevator to drain > the queue to guarantee completion ordering - the dispatch ordering > and flush/fua write semantics guarantee that when the flush/fua > completes, all the IOs dispatch prior to that flush/fua write are > also on stable storage... Des xfs rely on this model for correctness? If so, I'd say we've got a problem. Cheers, Jeff From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff Moyer Subject: Re: [RFC 00/11] DAX fsynx/msync support Date: Wed, 04 Nov 2015 13:34:58 -0500 Message-ID: References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> <20151101232948.GF10656@dastard> <20151102201029.GI10656@dastard> Mime-Version: 1.0 Content-Type: text/plain Cc: linux-nvdimm@ml01.01.org, "J. Bruce Fields" , linux-mm@kvack.org, Andreas Dilger , "H. Peter Anvin" , Jeff Layton , x86@kernel.org, Ingo Molnar , linux-ext4@vger.kernel.org, xfs@oss.sgi.com, Alexander Viro , Thomas Gleixner , axboe@kernel.dk, "Theodore Ts'o" , linux-kernel@vger.kernel.org, Jan Kara , linux-fsdevel@vger.kernel.org, Andrew Morton , Matthew Wilcox To: Dave Chinner Return-path: In-Reply-To: (Jeff Moyer's message of "Mon, 02 Nov 2015 16:02:48 -0500") Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org Jeff Moyer writes: >> Hence once the filesystem has waited on the REQ_WRITE|REQ_FLUSH IO >> to complete, we know that all the earlier REQ_WRITE IOs are on >> stable storage, too. Hence there's no need for the elevator to drain >> the queue to guarantee completion ordering - the dispatch ordering >> and flush/fua write semantics guarantee that when the flush/fua >> completes, all the IOs dispatch prior to that flush/fua write are >> also on stable storage... > > Des xfs rely on this model for correctness? If so, I'd say we've got a > problem. Dave? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Chinner Subject: Re: [RFC 00/11] DAX fsynx/msync support Date: Thu, 5 Nov 2015 19:33:09 +1100 Message-ID: <20151105083309.GJ19199@dastard> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> <20151101232948.GF10656@dastard> <20151102201029.GI10656@dastard> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dan Williams , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox , axboe@kernel.dk To: Jeff Moyer Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org [ sorry for slow response, been without an internet connection for ~36 hours ] On Mon, Nov 02, 2015 at 04:02:48PM -0500, Jeff Moyer wrote: > Dave Chinner writes: > > > On Mon, Nov 02, 2015 at 09:22:15AM -0500, Jeff Moyer wrote: > >> Dave Chinner writes: > >> > >> > Further, REQ_FLUSH/REQ_FUA are more than just "put the data on stable > >> > storage" commands. They are also IO barriers that affect scheduling > >> > of IOs in progress and in the request queues. A REQ_FLUSH/REQ_FUA > >> > IO cannot be dispatched before all prior IO has been dispatched and > >> > drained from the request queue, and IO submitted after a queued > >> > REQ_FLUSH/REQ_FUA cannot be scheduled ahead of the queued > >> > REQ_FLUSH/REQ_FUA operation. > >> > > >> > IOWs, REQ_FUA/REQ_FLUSH not only guarantee data is on stable > >> > storage, they also guarantee the order of IO dispatch and > >> > completion when concurrent IO is in progress. > >> > >> This hasn't been the case for several years, now. It used to work that > >> way, and that was deemed a big performance problem. Since file systems > >> already issued and waited for all I/O before sending down a barrier, we > >> decided to get rid of the I/O ordering pieces of barriers (and stop > >> calling them barriers). > >> > >> See commit 28e7d184521 (block: drop barrier ordering by queue draining). > > > > Yes, I realise that, even if I wasn't very clear about how I wrote > > it. ;) > > > > Correct me if I'm wrong: AFAIA, dispatch ordering (i.e. the "IO > > barrier") is still enforced by the scheduler via REQ_FUA|REQ_FLUSH > > -> ELEVATOR_INSERT_FLUSH -> REQ_SOFTBARRIER and subsequent IO > > scheduler calls to elv_dispatch_sort() that don't pass > > REQ_SOFTBARRIER in the queue. > > This part is right. > > > IOWs, if we queue a bunch of REQ_WRITE IOs followed by a > > REQ_WRITE|REQ_FLUSH IO, all of the prior REQ_WRITE IOs will be > > dispatched before the REQ_WRITE|REQ_FLUSH IO and hence be captured > > by the cache flush. > > But this part is not. It is up to the I/O scheduler to decide when to > dispatch requests. It can hold on to them for a variety of reasons. > Flush requests, however, do not go through the I/O scheduler. At the That's pure REQ_FLUSH bios, right? Aren't data IOs with REQ_FLUSH|REQ_FUA sorted like any other IO? > very moment that the flush request is inserted, it goes directly to the > dispatch queue (assuming no other flush is in progress). The prior > requests may still be waiting in the I/O scheduler's internal lists. > > So, any newly dispatched I/Os will certainly not get past the REQ_FLUSH. > However, the REQ_FLUSH is very likely to jump ahead of prior I/Os in the > queue. Uh, ok, that's different, and most definitely not the "IO barrier" I was under the impression REQ_FLUSH|REQ_FUA gave us. > > Hence once the filesystem has waited on the REQ_WRITE|REQ_FLUSH IO > > to complete, we know that all the earlier REQ_WRITE IOs are on > > stable storage, too. Hence there's no need for the elevator to drain > > the queue to guarantee completion ordering - the dispatch ordering > > and flush/fua write semantics guarantee that when the flush/fua > > completes, all the IOs dispatch prior to that flush/fua write are > > also on stable storage... > > Des xfs rely on this model for correctness? If so, I'd say we've got a > problem No, it doesn't. The XFS integrity model doesn't trust the IO layers to tell the truth about IO ordering and completion or for it's developers to fully understand how IO layer ordering works. :P i.e. we wait for full completions of all dependent IO before issuing flushes or log writes that use REQ_FLUSH|REQ_FUA semantics to ensure the dependent IOs are fully caught by the cache flushes... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff Moyer Subject: Re: [RFC 00/11] DAX fsynx/msync support Date: Thu, 05 Nov 2015 14:49:21 -0500 Message-ID: References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> <20151101232948.GF10656@dastard> <20151102201029.GI10656@dastard> <20151105083309.GJ19199@dastard> Mime-Version: 1.0 Content-Type: text/plain Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox , axboe@kernel.dk To: Dave Chinner Return-path: In-Reply-To: <20151105083309.GJ19199@dastard> (Dave Chinner's message of "Thu, 5 Nov 2015 19:33:09 +1100") Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org Dave Chinner writes: >> But this part is not. It is up to the I/O scheduler to decide when to >> dispatch requests. It can hold on to them for a variety of reasons. >> Flush requests, however, do not go through the I/O scheduler. At the > > That's pure REQ_FLUSH bios, right? Aren't data IOs with > REQ_FLUSH|REQ_FUA sorted like any other IO? No, they also go through the flush machinery, and so short-circuit the I/O scheduler. >> Des xfs rely on this model for correctness? If so, I'd say we've got a >> problem > > No, it doesn't. The XFS integrity model doesn't trust the IO layers > to tell the truth about IO ordering and completion or for it's > developers to fully understand how IO layer ordering works. :P > > i.e. we wait for full completions of all dependent IO before issuing > flushes or log writes that use REQ_FLUSH|REQ_FUA semantics to ensure > the dependent IOs are fully caught by the cache flushes... OK, phew! ;-) -Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: Re: [RFC 00/11] DAX fsynx/msync support Date: Thu, 5 Nov 2015 13:54:19 -0700 Message-ID: <563BC1FB.60004@kernel.dk> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> <20151101232948.GF10656@dastard> <20151102201029.GI10656@dastard> <20151105083309.GJ19199@dastard> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dan Williams , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox To: Dave Chinner , Jeff Moyer Return-path: In-Reply-To: <20151105083309.GJ19199@dastard> Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org On 11/05/2015 01:33 AM, Dave Chinner wrote: >> Des xfs rely on this model for correctness? If so, I'd say we've got a >> problem > > No, it doesn't. The XFS integrity model doesn't trust the IO layers > to tell the truth about IO ordering and completion or for it's > developers to fully understand how IO layer ordering works. :P That's good, because the storage developers simplified the model so that fs developers would be able to get and use it. > i.e. we wait for full completions of all dependent IO before issuing > flushes or log writes that use REQ_FLUSH|REQ_FUA semantics to ensure > the dependent IOs are fully caught by the cache flushes... ... which is what you are supposed to do, that's how it works. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Ross Zwisler Subject: [RFC 09/11] dax: add support for fsync/sync Date: Thu, 29 Oct 2015 14:12:13 -0600 Message-Id: <1446149535-16200-10-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox List-ID: To properly handle fsync/msync in an efficient way DAX needs to track dirty pages so it is able to flush them durably to media on demand. The tracking of dirty pages is done via the radix tree in struct address_space. This radix tree is already used by the page writeback infrastructure for tracking dirty pages associated with an open file, and it already has support for exceptional (non struct page*) entries. We build upon these features to add exceptional entries to the radix tree for DAX dirty PMD or PTE pages at fault time. When called as part of the msync/fsync flush path DAX queries the radix tree for dirty entries, flushing them and then marking the PTE or PMD page table entries as clean. The step of cleaning the PTE or PMD entries is necessary so that on subsequent writes to the same page we get a new write fault, allowing us to re-enter the DAX tag into the radix tree. Signed-off-by: Ross Zwisler --- fs/dax.c | 161 ++++++++++++++++++++++++++++++++++++++++++++++++++-- include/linux/dax.h | 1 + mm/huge_memory.c | 14 ++--- mm/page-writeback.c | 9 +++ 4 files changed, 172 insertions(+), 13 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 131fd35a..3b38aff 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -24,10 +24,13 @@ #include #include #include +#include #include +#include #include #include #include +#include /* * dax_clear_blocks() is called from within transaction context from XFS, @@ -287,6 +290,42 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh, return 0; } +static int dax_dirty_pgoff(struct vm_area_struct *vma, + struct address_space *mapping, unsigned long pgoff, + bool pmd_entry) +{ + struct radix_tree_root *page_tree = &mapping->page_tree; + void *tag; + int error = 0; + + __mark_inode_dirty(file_inode(vma->vm_file), I_DIRTY_PAGES); + + spin_lock_irq(&mapping->tree_lock); + + tag = radix_tree_lookup(page_tree, pgoff); + if (tag) { + if (pmd_entry && tag == RADIX_TREE_DAX_PTE) { + radix_tree_delete(&mapping->page_tree, pgoff); + mapping->nrdax--; + } else + goto out; + } + + if (pmd_entry) + error = radix_tree_insert(page_tree, pgoff, RADIX_TREE_DAX_PMD); + else + error = radix_tree_insert(page_tree, pgoff, RADIX_TREE_DAX_PTE); + + if (error) + goto out; + + mapping->nrdax++; + radix_tree_tag_set(page_tree, pgoff, PAGECACHE_TAG_DIRTY); + out: + spin_unlock_irq(&mapping->tree_lock); + return error; +} + static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, struct vm_area_struct *vma, struct vm_fault *vmf) { @@ -450,6 +489,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, delete_from_page_cache(page); unlock_page(page); page_cache_release(page); + page = NULL; } /* @@ -463,6 +503,13 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, * as for normal BH based IO completions. */ error = dax_insert_mapping(inode, &bh, vma, vmf); + if (error) + goto out; + + error = dax_dirty_pgoff(vma, inode->i_mapping, vmf->pgoff, false); + if (error) + goto out; + if (buffer_unwritten(&bh)) { if (complete_unwritten) complete_unwritten(&bh, !error); @@ -537,7 +584,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, pgoff_t size, pgoff; sector_t block, sector; unsigned long pfn; - int result = 0; + int error, result = 0; /* Fall back to PTEs if we're going to COW */ if (write && !(vma->vm_flags & VM_SHARED)) @@ -638,6 +685,10 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, } result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write); + + error = dax_dirty_pgoff(vma, inode->i_mapping, pgoff, true); + if (error) + goto fallback; } out: @@ -693,11 +744,11 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault); */ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) { - struct super_block *sb = file_inode(vma->vm_file)->i_sb; + struct file *file = vma->vm_file; + struct address_space *mapping = file->f_mapping; + struct inode *inode = mapping->host; - sb_start_pagefault(sb); - file_update_time(vma->vm_file); - sb_end_pagefault(sb); + dax_dirty_pgoff(vma, inode->i_mapping, vmf->pgoff, false); return VM_FAULT_NOPAGE; } EXPORT_SYMBOL_GPL(dax_pfn_mkwrite); @@ -772,3 +823,103 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) return dax_zero_page_range(inode, from, length, get_block); } EXPORT_SYMBOL_GPL(dax_truncate_page); + +static int dax_flush_one_mapping(struct address_space *mapping, + struct inode *inode, sector_t block, void *tag) +{ + get_block_t *get_block = inode->i_op->get_block; + struct buffer_head bh; + void __pmem *addr; + int ret; + + BUG_ON(tag != RADIX_TREE_DAX_PMD && tag != RADIX_TREE_DAX_PTE); + + memset(&bh, 0, sizeof(bh)); + + if (tag == RADIX_TREE_DAX_PMD) + bh.b_size = PMD_SIZE; + else + bh.b_size = PAGE_SIZE; + + ret = get_block(inode, block, &bh, false); + BUG_ON(!buffer_written(&bh)); + if (ret < 0) + return ret; + + ret = dax_get_addr(&bh, &addr, inode->i_blkbits); + if (ret < 0) + return ret; + + if (tag == RADIX_TREE_DAX_PMD) + WARN_ON(ret != PMD_SIZE); + else + WARN_ON(ret != PAGE_SIZE); + + wb_cache_pmem(addr, ret); + + spin_lock_irq(&mapping->tree_lock); + radix_tree_delete(&mapping->page_tree, block); + spin_unlock_irq(&mapping->tree_lock); + mapping->nrdax--; + + return pgoff_mkclean(block, mapping); +} + +/* + * flush the mapping to the persistent domain within the byte range of (start, + * end). This is required by data integrity operations to ensure file data is on + * persistent storage prior to completion of the operation. It also requires us + * to clean the mappings (i.e. write -> RO) so that we'll get a new fault when + * the file is written to again so wehave an indication that we need to flush + * the mapping if a data integrity operation takes place. + * + * We don't need commits to storage here - the filesystems will issue flushes + * appropriately at the conclusion of the data integrity operation via REQ_FUA + * writes or blkdev_issue_flush() commands. This requires the DAX block device + * to implement persistent storage domain fencing/commits on receiving a + * REQ_FLUSH or REQ_FUA request so that this works as expected by the higher + * layers. + */ +int dax_flush_mapping(struct address_space *mapping, loff_t start, loff_t end) +{ + struct inode *inode = mapping->host; + pgoff_t indices[PAGEVEC_SIZE]; + struct pagevec pvec; + int i, error; + + pgoff_t start_page = start >> PAGE_CACHE_SHIFT; + pgoff_t end_page = end >> PAGE_CACHE_SHIFT; + + + if (mapping->nrdax == 0) + return 0; + + if (!inode->i_op->get_block) { + WARN_ONCE(1, "Flushing DAX mapping without get_block()!"); + mapping->nrdax = 0; + return 0; + } + + BUG_ON(inode->i_blkbits != PAGE_SHIFT); + + tag_pages_for_writeback(mapping, start_page, end_page); + + pagevec_init(&pvec, 0); + while (1) { + pvec.nr = find_get_entries_tag(mapping, start_page, + PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE, + pvec.pages, indices); + + if (pvec.nr == 0) + break; + + for (i = 0; i < pvec.nr; i++) { + error = dax_flush_one_mapping(mapping, inode, + indices[i], pvec.pages[i]); + if (error) + return error; + } + } + + return 0; +} diff --git a/include/linux/dax.h b/include/linux/dax.h index e9d57f68..5eff476 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -41,4 +41,5 @@ static inline bool dax_mapping(struct address_space *mapping) { return mapping->host && IS_DAX(mapping->host); } +int dax_flush_mapping(struct address_space *mapping, loff_t start, loff_t end); #endif diff --git a/mm/huge_memory.c b/mm/huge_memory.c index bbac913..1b3df56 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -877,15 +877,13 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, spinlock_t *ptl; ptl = pmd_lock(mm, pmd); - if (pmd_none(*pmd)) { - entry = pmd_mkhuge(pfn_pmd(pfn, prot)); - if (write) { - entry = pmd_mkyoung(pmd_mkdirty(entry)); - entry = maybe_pmd_mkwrite(entry, vma); - } - set_pmd_at(mm, addr, pmd, entry); - update_mmu_cache_pmd(vma, addr, pmd); + entry = pmd_mkhuge(pfn_pmd(pfn, prot)); + if (write) { + entry = pmd_mkyoung(pmd_mkdirty(entry)); + entry = maybe_pmd_mkwrite(entry, vma); } + set_pmd_at(mm, addr, pmd, entry); + update_mmu_cache_pmd(vma, addr, pmd); spin_unlock(ptl); } diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2c90357..1801df8 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -37,6 +37,7 @@ #include #include #include +#include #include #include "internal.h" @@ -2338,6 +2339,14 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc) if (wbc->nr_to_write <= 0) return 0; + + if (wbc->sync_mode == WB_SYNC_ALL && dax_mapping(mapping)) { + ret = dax_flush_mapping(mapping, wbc->range_start, + wbc->range_end); + if (ret) + return ret; + } + if (mapping->a_ops->writepages) ret = mapping->a_ops->writepages(mapping, wbc); else -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Ross Zwisler Subject: [RFC 10/11] xfs, ext2: call dax_pfn_mkwrite() on write fault Date: Thu, 29 Oct 2015 14:12:14 -0600 Message-Id: <1446149535-16200-11-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox List-ID: Previously the functionality offered by dax_pfn_mkwrite() was open-coded in XFS and ext2. With the addition of DAX msync/fsync support we need to call dax_pfn_mkwrite() so that the newly writeable page can be added to the radix tree and marked as dirty. Signed-off-by: Ross Zwisler --- fs/ext2/file.c | 4 +++- fs/xfs/xfs_file.c | 9 +++++---- 2 files changed, 8 insertions(+), 5 deletions(-) diff --git a/fs/ext2/file.c b/fs/ext2/file.c index fc1418c..5741e29 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -102,8 +102,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma, { struct inode *inode = file_inode(vma->vm_file); struct ext2_inode_info *ei = EXT2_I(inode); - int ret = VM_FAULT_NOPAGE; loff_t size; + int ret; sb_start_pagefault(inode->i_sb); file_update_time(vma->vm_file); @@ -113,6 +113,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma, size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; if (vmf->pgoff >= size) ret = VM_FAULT_SIGBUS; + else + ret = dax_pfn_mkwrite(vma, vmf); up_read(&ei->dax_sem); sb_end_pagefault(inode->i_sb); diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index f429662..584e4eb 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1575,9 +1575,8 @@ xfs_filemap_pmd_fault( /* * pfn_mkwrite was originally inteneded to ensure we capture time stamp * updates on write faults. In reality, it's need to serialise against - * truncate similar to page_mkwrite. Hence we open-code dax_pfn_mkwrite() - * here and cycle the XFS_MMAPLOCK_SHARED to ensure we serialise the fault - * barrier in place. + * truncate similar to page_mkwrite. Hence we cycle the XFS_MMAPLOCK_SHARED + * to ensure we serialise the fault barrier in place. */ static int xfs_filemap_pfn_mkwrite( @@ -1587,7 +1586,7 @@ xfs_filemap_pfn_mkwrite( struct inode *inode = file_inode(vma->vm_file); struct xfs_inode *ip = XFS_I(inode); - int ret = VM_FAULT_NOPAGE; + int ret; loff_t size; trace_xfs_filemap_pfn_mkwrite(ip); @@ -1600,6 +1599,8 @@ xfs_filemap_pfn_mkwrite( size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; if (vmf->pgoff >= size) ret = VM_FAULT_SIGBUS; + else + ret = dax_pfn_mkwrite(vma, vmf); xfs_iunlock(ip, XFS_MMAPLOCK_SHARED); sb_end_pagefault(inode->i_sb); return ret; -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Ross Zwisler Subject: [RFC 05/11] mm: add follow_pte_pmd() Date: Thu, 29 Oct 2015 14:12:09 -0600 Message-Id: <1446149535-16200-6-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox List-ID: Similar to follow_pte(), follow_pte_pmd() allows either a PTE leaf or a huge page PMD leaf to be found and returned. Signed-off-by: Ross Zwisler --- include/linux/mm.h | 2 ++ mm/memory.c | 41 +++++++++++++++++++++++++++++++++-------- 2 files changed, 35 insertions(+), 8 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 80001de..393441c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1166,6 +1166,8 @@ int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma); void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows); +int follow_pte_pmd(struct mm_struct *mm, unsigned long address, + pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp); int follow_pfn(struct vm_area_struct *vma, unsigned long address, unsigned long *pfn); int follow_phys(struct vm_area_struct *vma, unsigned long address, diff --git a/mm/memory.c b/mm/memory.c index deb679c..c2b8c0a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3512,8 +3512,8 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) } #endif /* __PAGETABLE_PMD_FOLDED */ -static int __follow_pte(struct mm_struct *mm, unsigned long address, - pte_t **ptepp, spinlock_t **ptlp) +static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address, + pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp) { pgd_t *pgd; pud_t *pud; @@ -3529,12 +3529,23 @@ static int __follow_pte(struct mm_struct *mm, unsigned long address, goto out; pmd = pmd_offset(pud, address); - VM_BUG_ON(pmd_trans_huge(*pmd)); - if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) - goto out; - /* We cannot handle huge page PFN maps. Luckily they don't exist. */ - if (pmd_huge(*pmd)) + if (pmd_huge(*pmd)) { + if (!pmdpp) + goto out; + + *ptlp = pmd_lock(mm, pmd); + if (pmd_huge(*pmd)) { + /* Success, we found a large PTE */ + *pmdpp = pmd; + return 0; + } + /* Somebody removed the PMD entry, try it as a pte */ + spin_unlock(*ptlp); + } + + /* FIXME: pmd_bad() is sometimes set for DAX pmds? */ + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) goto out; ptep = pte_offset_map_lock(mm, pmd, address, ptlp); @@ -3557,9 +3568,23 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address, /* (void) is needed to make gcc happy */ (void) __cond_lock(*ptlp, - !(res = __follow_pte(mm, address, ptepp, ptlp))); + !(res = __follow_pte_pmd(mm, address, ptepp, NULL, + ptlp))); + return res; +} + +int follow_pte_pmd(struct mm_struct *mm, unsigned long address, + pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp) +{ + int res; + + /* (void) is needed to make gcc happy */ + (void) __cond_lock(*ptlp, + !(res = __follow_pte_pmd(mm, address, ptepp, pmdpp, + ptlp))); return res; } +EXPORT_SYMBOL(follow_pte_pmd); /** * follow_pfn - look up PFN at a user virtual address -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <20151030194300.GA22670@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030194300.GA22670@linux.intel.com> Date: Fri, 30 Oct 2015 12:51:40 -0700 Message-ID: Subject: Re: [RFC 00/11] DAX fsynx/msync support From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Ross Zwisler , Dan Williams , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox List-ID: On Fri, Oct 30, 2015 at 12:43 PM, Ross Zwisler wrote: > On Fri, Oct 30, 2015 at 11:34:07AM -0700, Dan Williams wrote: >> On Thu, Oct 29, 2015 at 1:12 PM, Ross Zwisler >> wrote: >> > This patch series adds support for fsync/msync to DAX. >> > >> > Patches 1 through 8 add various utilities that the DAX code will eventually >> > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are >> > filesystem changes that are needed after the DAX code is added, but these >> > patches may change slightly as the filesystem fault handling for DAX is >> > being modified ([1] and [2]). >> > >> > I've marked this series as RFC because I'm still testing, but I wanted to >> > get this out there so people would see the direction I was going and >> > hopefully comment on any big red flags sooner rather than later. >> > >> > I realize that we are getting pretty dang close to the v4.4 merge window, >> > but I think that if we can get this reviewed and working it's a much better >> > solution than the "big hammer" approach that blindly flushes entire PMEM >> > namespaces [3]. >> > >> > [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html >> > [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2 >> > [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html >> > >> > Ross Zwisler (11): >> > pmem: add wb_cache_pmem() to the PMEM API >> > mm: add pmd_mkclean() >> > pmem: enable REQ_FLUSH handling >> > dax: support dirty DAX entries in radix tree >> > mm: add follow_pte_pmd() >> > mm: add pgoff_mkclean() >> > mm: add find_get_entries_tag() >> > fs: add get_block() to struct inode_operations >> > dax: add support for fsync/sync >> > xfs, ext2: call dax_pfn_mkwrite() on write fault >> > ext4: add ext4_dax_pfn_mkwrite() >> >> This is great to have when the flush-the-world solution ends up >> killing performance. However, there are a couple mitigating options >> for workloads that dirty small amounts and flush often that we need to >> collect data on: >> >> 1/ Using cache management and pcommit from userspace to skip calls to >> msync / fsync. Although, this does not eliminate all calls to >> blkdev_issue_flush as the fs may invoke it for other reasons. I >> suspect turning on REQ_FUA support eliminates a number of those >> invocations, and pmem already satisfies REQ_FUA semantics by default. > > Sure, I'll turn on REQ_FUA in addition to REQ_FLUSH - I agree that PMEM > already handles the requirements of REQ_FUA, but I didn't realize that it > might reduce the number of REQ_FLUSH bios we receive. I'll let Dave chime in, but a lot of the flush requirements come from guaranteeing the state of the metadata, if metadata updates can be done with REQ_FUA then there is no subsequent need to flush. >> 2/ Turn off DAX and use the page cache. As Dave mentions [1] we >> should enable this control on a per-inode basis. I'm folding in this >> capability as a blkdev_ioctl for the next version of the raw block DAX >> support patch. > > Umm...I think you just said "the way to avoid this delay is to just not use > DAX". :) I don't think this is where we want to go - we are trying to make > DAX better, not abandon it. That's a bit of an exaggeration. Avoiding DAX where it is not necessary is not "abandoning DAX", it's using the right tool for the job. Page cache is fine for many cases. >> It's entirely possible these mitigations won't eliminate the need for >> a mechanism like this, but I think we have a bit more work to do to >> find out how bad this is in practice as well as the crossover point >> where walking the radix becomes prohibitive. > > I'm guessing a single run through xfstests will be enough to convince you that > the "big hammer" approach is untenable. Tests that used to take a second now > take several minutes, at least in my VM testing environment... And that's > only using a tiny 4GiB namespace. > > Yes, we can distribute the cost over multiple CPUs, but that just distributes > the problem and doesn't reduce the overall work that needs to be done. > Ultimately I think that looping through multiple GiB or even TiB of cache > lines and blindly writing them back individually on every REQ_FLUSH is going > to be a deal breaker. Right, part of the problem is that the driver doesn't know which blocks are actively DAX mapped. I think we can incrementally fix that without requiring DAX specific fsync/msync handling code for each fs that supports DAX. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 29 Oct 2015 16:49:53 -0600 From: Ross Zwisler Subject: Re: [RFC 00/11] DAX fsynx/msync support Message-ID: <20151029224953.GA17933@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: Ross Zwisler , Dave Chinner Cc: linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dan Williams , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox List-ID: On Thu, Oct 29, 2015 at 02:12:04PM -0600, Ross Zwisler wrote: > This patch series adds support for fsync/msync to DAX. > > Patches 1 through 8 add various utilities that the DAX code will eventually > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are > filesystem changes that are needed after the DAX code is added, but these > patches may change slightly as the filesystem fault handling for DAX is > being modified ([1] and [2]). > > I've marked this series as RFC because I'm still testing, but I wanted to > get this out there so people would see the direction I was going and > hopefully comment on any big red flags sooner rather than later. > > I realize that we are getting pretty dang close to the v4.4 merge window, > but I think that if we can get this reviewed and working it's a much better > solution than the "big hammer" approach that blindly flushes entire PMEM > namespaces [3]. > > [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html > [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2 > [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html Hmm...I think I may need to isolate the fsync/msync flushing against races with truncate since we are calling into the filesystem directly with get_block(). Dave (Chinner), does this sound right? Also, one thing I forgot to mention is that these patches are built upon the first version of Dave Chinner's XFS patches and my ext2 patches that deal with the truncate races with DAX. A snapshot of my development tree with these patches applied can be found here: https://github.com/01org/prd/tree/fsync_rfc -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id EBF697CBF for ; Thu, 29 Oct 2015 15:12:31 -0500 (CDT) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay1.corp.sgi.com (Postfix) with ESMTP id DBA588F8049 for ; Thu, 29 Oct 2015 13:12:31 -0700 (PDT) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by cuda.sgi.com with ESMTP id dL7R1tmoTZbvxvof for ; Thu, 29 Oct 2015 13:12:30 -0700 (PDT) From: Ross Zwisler Subject: [RFC 02/11] mm: add pmd_mkclean() Date: Thu, 29 Oct 2015 14:12:06 -0600 Message-Id: <1446149535-16200-3-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Jeff Layton , x86@kernel.org, Theodore Ts'o , Andrew Morton , Thomas Gleixner , linux-nvdimm@lists.01.org, Jan Kara , xfs@oss.sgi.com, "J. Bruce Fields" , linux-mm@kvack.org, Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Dan Williams , Matthew Wilcox Currently PMD pages can be dirtied via pmd_mkdirty(), but cannot be cleaned. For DAX mmap dirty page tracking we need to be able to clean PMD pages when we flush them to media so that we get a new write fault the next time the are written to. Signed-off-by: Ross Zwisler --- arch/x86/include/asm/pgtable.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 867da5b..c548e4c 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -277,6 +277,11 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd) return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); } +static inline pmd_t pmd_mkclean(pmd_t pmd) +{ + return pmd_clear_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); +} + static inline pmd_t pmd_mkhuge(pmd_t pmd) { return pmd_set_flags(pmd, _PAGE_PSE); -- 2.1.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id DF8907CBF for ; Thu, 29 Oct 2015 15:12:32 -0500 (CDT) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay2.corp.sgi.com (Postfix) with ESMTP id CFEC130404E for ; Thu, 29 Oct 2015 13:12:32 -0700 (PDT) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by cuda.sgi.com with ESMTP id 3CAC24cb9E0JNJWp for ; Thu, 29 Oct 2015 13:12:31 -0700 (PDT) From: Ross Zwisler Subject: [RFC 03/11] pmem: enable REQ_FLUSH handling Date: Thu, 29 Oct 2015 14:12:07 -0600 Message-Id: <1446149535-16200-4-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Jeff Layton , x86@kernel.org, Theodore Ts'o , Andrew Morton , Thomas Gleixner , linux-nvdimm@lists.01.org, Jan Kara , xfs@oss.sgi.com, "J. Bruce Fields" , linux-mm@kvack.org, Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Dan Williams , Matthew Wilcox Currently the PMEM driver doesn't accept REQ_FLUSH bios. These are sent down via blkdev_issue_flush() in response to a fsync() or msync(). When we get an msync() or fsync() it is the responsibility of the DAX code to flush all dirty pages to media. The PMEM driver then just has issue a wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all the flushed data has been durably stored on the media. Signed-off-by: Ross Zwisler --- drivers/nvdimm/pmem.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 0ba6a97..e1e222e 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -80,7 +80,7 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio) if (do_acct) nd_iostat_end(bio, start); - if (bio_data_dir(bio)) + if (bio_data_dir(bio) || (bio->bi_rw & REQ_FLUSH)) wmb_pmem(); bio_endio(bio); @@ -189,6 +189,7 @@ static int pmem_attach_disk(struct device *dev, blk_queue_physical_block_size(pmem->pmem_queue, PAGE_SIZE); blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX); blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY); + blk_queue_flush(pmem->pmem_queue, REQ_FLUSH); queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem->pmem_queue); disk = alloc_disk(0); -- 2.1.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 220237F3F for ; Thu, 29 Oct 2015 15:12:33 -0500 (CDT) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay2.corp.sgi.com (Postfix) with ESMTP id 1344B30404E for ; Thu, 29 Oct 2015 13:12:29 -0700 (PDT) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by cuda.sgi.com with ESMTP id ouHuDNMaT7wwrXxX for ; Thu, 29 Oct 2015 13:12:28 -0700 (PDT) From: Ross Zwisler Subject: [RFC 00/11] DAX fsynx/msync support Date: Thu, 29 Oct 2015 14:12:04 -0600 Message-Id: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Jeff Layton , x86@kernel.org, Theodore Ts'o , Andrew Morton , Thomas Gleixner , linux-nvdimm@lists.01.org, Jan Kara , xfs@oss.sgi.com, "J. Bruce Fields" , linux-mm@kvack.org, Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Dan Williams , Matthew Wilcox This patch series adds support for fsync/msync to DAX. Patches 1 through 8 add various utilities that the DAX code will eventually need, and the DAX code itself is added by patch 9. Patches 10 and 11 are filesystem changes that are needed after the DAX code is added, but these patches may change slightly as the filesystem fault handling for DAX is being modified ([1] and [2]). I've marked this series as RFC because I'm still testing, but I wanted to get this out there so people would see the direction I was going and hopefully comment on any big red flags sooner rather than later. I realize that we are getting pretty dang close to the v4.4 merge window, but I think that if we can get this reviewed and working it's a much better solution than the "big hammer" approach that blindly flushes entire PMEM namespaces [3]. [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2 [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html Ross Zwisler (11): pmem: add wb_cache_pmem() to the PMEM API mm: add pmd_mkclean() pmem: enable REQ_FLUSH handling dax: support dirty DAX entries in radix tree mm: add follow_pte_pmd() mm: add pgoff_mkclean() mm: add find_get_entries_tag() fs: add get_block() to struct inode_operations dax: add support for fsync/sync xfs, ext2: call dax_pfn_mkwrite() on write fault ext4: add ext4_dax_pfn_mkwrite() arch/x86/include/asm/pgtable.h | 5 ++ arch/x86/include/asm/pmem.h | 11 +-- drivers/nvdimm/pmem.c | 3 +- fs/dax.c | 161 +++++++++++++++++++++++++++++++++++++++-- fs/ext2/file.c | 5 +- fs/ext4/file.c | 23 +++++- fs/inode.c | 1 + fs/xfs/xfs_file.c | 9 ++- fs/xfs/xfs_iops.c | 1 + include/linux/dax.h | 6 ++ include/linux/fs.h | 5 +- include/linux/mm.h | 2 + include/linux/pagemap.h | 3 + include/linux/pmem.h | 22 +++++- include/linux/radix-tree.h | 3 + include/linux/rmap.h | 5 ++ mm/filemap.c | 73 ++++++++++++++++++- mm/huge_memory.c | 14 ++-- mm/memory.c | 41 +++++++++-- mm/page-writeback.c | 9 +++ mm/rmap.c | 53 ++++++++++++++ mm/truncate.c | 5 +- 22 files changed, 418 insertions(+), 42 deletions(-) -- 2.1.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id 996997F3F for ; Thu, 29 Oct 2015 15:12:34 -0500 (CDT) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay1.corp.sgi.com (Postfix) with ESMTP id 8A4F78F804C for ; Thu, 29 Oct 2015 13:12:31 -0700 (PDT) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by cuda.sgi.com with ESMTP id p6hl9FycSJHYkd5x for ; Thu, 29 Oct 2015 13:12:29 -0700 (PDT) From: Ross Zwisler Subject: [RFC 01/11] pmem: add wb_cache_pmem() to the PMEM API Date: Thu, 29 Oct 2015 14:12:05 -0600 Message-Id: <1446149535-16200-2-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Jeff Layton , x86@kernel.org, Theodore Ts'o , Andrew Morton , Thomas Gleixner , linux-nvdimm@lists.01.org, Jan Kara , xfs@oss.sgi.com, "J. Bruce Fields" , linux-mm@kvack.org, Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Dan Williams , Matthew Wilcox The function __arch_wb_cache_pmem() was already an internal implementation detail of the x86 PMEM API, but this functionality needs to be exported as part of the general PMEM API to handle the fsync/msync case for DAX mmaps. Signed-off-by: Ross Zwisler --- arch/x86/include/asm/pmem.h | 11 ++++++----- include/linux/pmem.h | 22 +++++++++++++++++++++- 2 files changed, 27 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h index d8ce3ec..6c7ade0 100644 --- a/arch/x86/include/asm/pmem.h +++ b/arch/x86/include/asm/pmem.h @@ -67,18 +67,19 @@ static inline void arch_wmb_pmem(void) } /** - * __arch_wb_cache_pmem - write back a cache range with CLWB + * arch_wb_cache_pmem - write back a cache range with CLWB * @vaddr: virtual start address * @size: number of bytes to write back * * Write back a cache range using the CLWB (cache line write back) * instruction. This function requires explicit ordering with an - * arch_wmb_pmem() call. This API is internal to the x86 PMEM implementation. + * arch_wmb_pmem() call. */ -static inline void __arch_wb_cache_pmem(void *vaddr, size_t size) +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size) { u16 x86_clflush_size = boot_cpu_data.x86_clflush_size; unsigned long clflush_mask = x86_clflush_size - 1; + void *vaddr = (void __force *)addr; void *vend = vaddr + size; void *p; @@ -115,7 +116,7 @@ static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes, len = copy_from_iter_nocache(vaddr, bytes, i); if (__iter_needs_pmem_wb(i)) - __arch_wb_cache_pmem(vaddr, bytes); + arch_wb_cache_pmem(addr, bytes); return len; } @@ -138,7 +139,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) else memset(vaddr, 0, size); - __arch_wb_cache_pmem(vaddr, size); + arch_wb_cache_pmem(addr, size); } static inline bool __arch_has_wmb_pmem(void) diff --git a/include/linux/pmem.h b/include/linux/pmem.h index 85f810b3..2cd5003 100644 --- a/include/linux/pmem.h +++ b/include/linux/pmem.h @@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) { BUG(); } + +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size) +{ + BUG(); +} #endif /* * Architectures that define ARCH_HAS_PMEM_API must provide * implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(), - * arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem(). + * arch_copy_from_iter_pmem(), arch_clear_pmem(), arch_wb_cache_pmem() + * and arch_has_wmb_pmem(). */ static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size) { @@ -202,4 +208,18 @@ static inline void clear_pmem(void __pmem *addr, size_t size) else default_clear_pmem(addr, size); } + +/** + * wb_cache_pmem - write back processor cache for PMEM memory range + * @addr: virtual start address + * @size: number of bytes to write back + * + * Write back the processor cache range starting at 'addr' for 'size' bytes. + * This function requires explicit ordering with a wmb_pmem() call. + */ +static inline void wb_cache_pmem(void __pmem *addr, size_t size) +{ + if (arch_has_pmem_api()) + arch_wb_cache_pmem(addr, size); +} #endif /* __PMEM_H__ */ -- 2.1.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id C140B7F4E for ; Thu, 29 Oct 2015 15:12:34 -0500 (CDT) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay1.corp.sgi.com (Postfix) with ESMTP id 82FA78F804B for ; Thu, 29 Oct 2015 13:12:34 -0700 (PDT) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by cuda.sgi.com with ESMTP id QyrThUPtrPOAD5Pi for ; Thu, 29 Oct 2015 13:12:32 -0700 (PDT) From: Ross Zwisler Subject: [RFC 04/11] dax: support dirty DAX entries in radix tree Date: Thu, 29 Oct 2015 14:12:08 -0600 Message-Id: <1446149535-16200-5-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Jeff Layton , x86@kernel.org, Theodore Ts'o , Andrew Morton , Thomas Gleixner , linux-nvdimm@lists.01.org, Jan Kara , xfs@oss.sgi.com, "J. Bruce Fields" , linux-mm@kvack.org, Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Dan Williams , Matthew Wilcox Add support for tracking dirty DAX entries in the struct address_space radix tree. This tree is already used for dirty page writeback, and it already supports the use of exceptional (non struct page*) entries. In order to properly track dirty DAX pages we will insert new exceptional entries into the radix tree that represent dirty DAX PTE or PMD pages. There are currently two types of exceptional entries (shmem and shadow) that can be placed into the radix tree, and this adds a third. There shouldn't be any collisions between these various exceptional entries because only one type of exceptional entry should be able to be found in a radix tree at a time depending on how it is being used. Signed-off-by: Ross Zwisler --- fs/inode.c | 1 + include/linux/dax.h | 5 +++++ include/linux/fs.h | 1 + include/linux/radix-tree.h | 3 +++ mm/filemap.c | 12 ++++++++---- mm/truncate.c | 5 +++-- 6 files changed, 21 insertions(+), 6 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 78a17b8..f7c87a6 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -496,6 +496,7 @@ void clear_inode(struct inode *inode) spin_lock_irq(&inode->i_data.tree_lock); BUG_ON(inode->i_data.nrpages); BUG_ON(inode->i_data.nrshadows); + BUG_ON(inode->i_data.nrdax); spin_unlock_irq(&inode->i_data.tree_lock); BUG_ON(!list_empty(&inode->i_data.private_list)); BUG_ON(!(inode->i_state & I_FREEING)); diff --git a/include/linux/dax.h b/include/linux/dax.h index b415e52..e9d57f68 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma) { return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); } + +static inline bool dax_mapping(struct address_space *mapping) +{ + return mapping->host && IS_DAX(mapping->host); +} #endif diff --git a/include/linux/fs.h b/include/linux/fs.h index 72d8a84..f791698 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -433,6 +433,7 @@ struct address_space { /* Protected by tree_lock together with the radix tree */ unsigned long nrpages; /* number of total pages */ unsigned long nrshadows; /* number of shadow entries */ + unsigned long nrdax; /* number of DAX entries */ pgoff_t writeback_index;/* writeback starts here */ const struct address_space_operations *a_ops; /* methods */ unsigned long flags; /* error bits/gfp mask */ diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index 33170db..fabec66 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -51,6 +51,9 @@ #define RADIX_TREE_EXCEPTIONAL_ENTRY 2 #define RADIX_TREE_EXCEPTIONAL_SHIFT 2 +#define RADIX_TREE_DAX_PTE ((void *)(0x10 | RADIX_TREE_EXCEPTIONAL_ENTRY)) +#define RADIX_TREE_DAX_PMD ((void *)(0x20 | RADIX_TREE_EXCEPTIONAL_ENTRY)) + static inline int radix_tree_is_indirect_ptr(void *ptr) { return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR); diff --git a/mm/filemap.c b/mm/filemap.c index 327910c..c3a9e4f 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -11,6 +11,7 @@ */ #include #include +#include #include #include #include @@ -440,7 +441,7 @@ int filemap_write_and_wait_range(struct address_space *mapping, { int err = 0; - if (mapping->nrpages) { + if (mapping->nrpages || mapping->nrdax) { err = __filemap_fdatawrite_range(mapping, lstart, lend, WB_SYNC_ALL); /* See comment of filemap_write_and_wait() */ @@ -538,6 +539,9 @@ static int page_cache_tree_insert(struct address_space *mapping, p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock); if (!radix_tree_exceptional_entry(p)) return -EEXIST; + + BUG_ON(dax_mapping(mapping)); + if (shadowp) *shadowp = p; mapping->nrshadows--; @@ -1201,9 +1205,9 @@ repeat: if (radix_tree_deref_retry(page)) goto restart; /* - * A shadow entry of a recently evicted page, - * or a swap entry from shmem/tmpfs. Return - * it without attempting to raise page count. + * A shadow entry of a recently evicted page, a swap + * entry from shmem/tmpfs or a DAX entry. Return it + * without attempting to raise page count. */ goto export; } diff --git a/mm/truncate.c b/mm/truncate.c index 76e35ad..cdf44a0 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -9,6 +9,7 @@ #include #include +#include #include #include #include @@ -29,8 +30,8 @@ static void clear_exceptional_entry(struct address_space *mapping, struct radix_tree_node *node; void **slot; - /* Handled by shmem itself */ - if (shmem_mapping(mapping)) + /* Handled by shmem or DAX directly */ + if (shmem_mapping(mapping) || dax_mapping(mapping)) return; spin_lock_irq(&mapping->tree_lock); -- 2.1.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 6E9007F53 for ; Thu, 29 Oct 2015 15:12:36 -0500 (CDT) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay2.corp.sgi.com (Postfix) with ESMTP id 323C3304053 for ; Thu, 29 Oct 2015 13:12:36 -0700 (PDT) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by cuda.sgi.com with ESMTP id PxN4OrpBzM9wdwZt for ; Thu, 29 Oct 2015 13:12:34 -0700 (PDT) From: Ross Zwisler Subject: [RFC 06/11] mm: add pgoff_mkclean() Date: Thu, 29 Oct 2015 14:12:10 -0600 Message-Id: <1446149535-16200-7-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Jeff Layton , x86@kernel.org, Theodore Ts'o , Andrew Morton , Thomas Gleixner , linux-nvdimm@lists.01.org, Jan Kara , xfs@oss.sgi.com, "J. Bruce Fields" , linux-mm@kvack.org, Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Dan Williams , Matthew Wilcox Introduce pgoff_mkclean() which conceptually is similar to page_mkclean() except it works in the absence of struct page and it can also be used to clean PMDs. This is needed for DAX's dirty page handling. Signed-off-by: Ross Zwisler --- include/linux/rmap.h | 5 +++++ mm/rmap.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 58 insertions(+) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 29446ae..627875f9 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -223,6 +223,11 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *); int page_mkclean(struct page *); /* + * Cleans and write protects the PTEs of shared mappings. + */ +int pgoff_mkclean(pgoff_t, struct address_space *); + +/* * called in munlock()/munmap() path to check for other vmas holding * the page mlocked. */ diff --git a/mm/rmap.c b/mm/rmap.c index f5b5c1f..0ce16ab 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -586,6 +586,16 @@ vma_address(struct page *page, struct vm_area_struct *vma) return address; } +static inline unsigned long +pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma) +{ + unsigned long address; + + address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); + VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); + return address; +} + #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH static void percpu_flush_tlb_batch_pages(void *data) { @@ -1040,6 +1050,49 @@ int page_mkclean(struct page *page) } EXPORT_SYMBOL_GPL(page_mkclean); +int pgoff_mkclean(pgoff_t pgoff, struct address_space *mapping) +{ + struct vm_area_struct *vma; + int ret = 0; + + i_mmap_lock_read(mapping); + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { + struct mm_struct *mm = vma->vm_mm; + pmd_t pmd, *pmdp = NULL; + pte_t pte, *ptep = NULL; + unsigned long address; + spinlock_t *ptl; + + address = pgoff_address(pgoff, vma); + + ret = follow_pte_pmd(mm, address, &ptep, &pmdp, &ptl); + if (ret) + goto out; + + if (pmdp) { + flush_cache_page(vma, address, pmd_pfn(*pmdp)); + pmd = pmdp_huge_clear_flush(vma, address, pmdp); + pmd = pmd_wrprotect(pmd); + pmd = pmd_mkclean(pmd); + set_pmd_at(mm, address, pmdp, pmd); + spin_unlock(ptl); + } else { + BUG_ON(!ptep); + flush_cache_page(vma, address, pte_pfn(*ptep)); + pte = ptep_clear_flush(vma, address, ptep); + pte = pte_wrprotect(pte); + pte = pte_mkclean(pte); + set_pte_at(mm, address, ptep, pte); + pte_unmap_unlock(ptep, ptl); + } + } + + out: + i_mmap_unlock_read(mapping); + return ret; +} +EXPORT_SYMBOL_GPL(pgoff_mkclean); + /** * page_move_anon_rmap - move a page to our anon_vma * @page: the page to move to our anon_vma -- 2.1.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 06ED07F4E for ; Thu, 29 Oct 2015 15:12:37 -0500 (CDT) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay2.corp.sgi.com (Postfix) with ESMTP id C488830404E for ; Thu, 29 Oct 2015 13:12:36 -0700 (PDT) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by cuda.sgi.com with ESMTP id kZJpRtpbdDDgEWIe for ; Thu, 29 Oct 2015 13:12:35 -0700 (PDT) From: Ross Zwisler Subject: [RFC 07/11] mm: add find_get_entries_tag() Date: Thu, 29 Oct 2015 14:12:11 -0600 Message-Id: <1446149535-16200-8-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Jeff Layton , x86@kernel.org, Theodore Ts'o , Andrew Morton , Thomas Gleixner , linux-nvdimm@lists.01.org, Jan Kara , xfs@oss.sgi.com, "J. Bruce Fields" , linux-mm@kvack.org, Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Dan Williams , Matthew Wilcox Add find_get_entries_tag() to the family of functions that include find_get_entries(), find_get_pages() and find_get_pages_tag(). This is needed for DAX dirty page handling because we need a list of both page offsets and radix tree entries ('indices' and 'entries' in this function) that are marked with the PAGECACHE_TAG_TOWRITE tag. Signed-off-by: Ross Zwisler --- include/linux/pagemap.h | 3 +++ mm/filemap.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 64 insertions(+) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index a6c78e0..6fea3be 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -354,6 +354,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start, unsigned int nr_pages, struct page **pages); unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, int tag, unsigned int nr_pages, struct page **pages); +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start, + int tag, unsigned int nr_entries, + struct page **entries, pgoff_t *indices); struct page *grab_cache_page_write_begin(struct address_space *mapping, pgoff_t index, unsigned flags); diff --git a/mm/filemap.c b/mm/filemap.c index c3a9e4f..992cf84 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1454,6 +1454,67 @@ repeat: } EXPORT_SYMBOL(find_get_pages_tag); +/** + * find_get_entries_tag - find and return entries that match @tag + * @mapping: the address_space to search + * @start: the starting page cache index + * @tag: the tag index + * @nr_entries: the maximum number of entries + * @entries: where the resulting entries are placed + * @indices: the cache indices corresponding to the entries in @entries + * + * Like find_get_entries, except we only return entries which are tagged with + * @tag. + */ +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start, + int tag, unsigned int nr_entries, + struct page **entries, pgoff_t *indices) +{ + void **slot; + unsigned int ret = 0; + struct radix_tree_iter iter; + + if (!nr_entries) + return 0; + + rcu_read_lock(); +restart: + radix_tree_for_each_tagged(slot, &mapping->page_tree, + &iter, start, tag) { + struct page *page; +repeat: + page = radix_tree_deref_slot(slot); + if (unlikely(!page)) + continue; + if (radix_tree_exception(page)) { + if (radix_tree_deref_retry(page)) + goto restart; + /* + * A shadow entry of a recently evicted page, a swap + * entry from shmem/tmpfs or a DAX entry. Return it + * without attempting to raise page count. + */ + goto export; + } + if (!page_cache_get_speculative(page)) + goto repeat; + + /* Has the page moved? */ + if (unlikely(page != *slot)) { + page_cache_release(page); + goto repeat; + } +export: + indices[ret] = iter.index; + entries[ret] = page; + if (++ret == nr_entries) + break; + } + rcu_read_unlock(); + return ret; +} +EXPORT_SYMBOL(find_get_entries_tag); + /* * CD/DVDs are error prone. When a medium error occurs, the driver may fail * a _large_ part of the i/o request. Imagine the worst scenario: -- 2.1.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id 08EE47CBF for ; Thu, 29 Oct 2015 15:12:38 -0500 (CDT) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay3.corp.sgi.com (Postfix) with ESMTP id 8C9F8AC003 for ; Thu, 29 Oct 2015 13:12:37 -0700 (PDT) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by cuda.sgi.com with ESMTP id JGC5Q0L5sdOVFkYR for ; Thu, 29 Oct 2015 13:12:36 -0700 (PDT) From: Ross Zwisler Subject: [RFC 08/11] fs: add get_block() to struct inode_operations Date: Thu, 29 Oct 2015 14:12:12 -0600 Message-Id: <1446149535-16200-9-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Jeff Layton , x86@kernel.org, Theodore Ts'o , Andrew Morton , Thomas Gleixner , linux-nvdimm@lists.01.org, Jan Kara , xfs@oss.sgi.com, "J. Bruce Fields" , linux-mm@kvack.org, Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Dan Williams , Matthew Wilcox To be able to flush dirty pages to media as part of the fsync/msync path DAX needs to be able to map file offsets to kernel addresses via a combination of the filesystem's get_block() routine and bdev_direct_access(). This currently happens in the DAX fault handlers which receive a get_block() callback directly from the filesystem via a function parameter. For the fsync/msync path this doesn't work, though, because DAX is called not by the filesystem but by the writeback infrastructure which doesn't know about the filesystem specific get_block() routine. To handle this we make get_block() an entry in the struct inode_operations table so that we can access the correct get_block() routine in the context of the writeback infrastructure. Signed-off-by: Ross Zwisler --- fs/ext2/file.c | 1 + fs/ext4/file.c | 1 + fs/xfs/xfs_iops.c | 1 + include/linux/fs.h | 4 ++-- 4 files changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 11a42c5..fc1418c 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -202,4 +202,5 @@ const struct inode_operations ext2_file_inode_operations = { .get_acl = ext2_get_acl, .set_acl = ext2_set_acl, .fiemap = ext2_fiemap, + .get_block = ext2_get_block, }; diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 113837e..54d7729 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -720,5 +720,6 @@ const struct inode_operations ext4_file_inode_operations = { .get_acl = ext4_get_acl, .set_acl = ext4_set_acl, .fiemap = ext4_fiemap, + .get_block = ext4_get_block_dax, }; diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 8294132..c58c270 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -1112,6 +1112,7 @@ static const struct inode_operations xfs_inode_operations = { .listxattr = xfs_vn_listxattr, .fiemap = xfs_vn_fiemap, .update_time = xfs_vn_update_time, + .get_block = xfs_get_blocks_direct, }; static const struct inode_operations xfs_dir_inode_operations = { diff --git a/include/linux/fs.h b/include/linux/fs.h index f791698..1dca85b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1679,8 +1679,8 @@ struct inode_operations { umode_t create_mode, int *opened); int (*tmpfile) (struct inode *, struct dentry *, umode_t); int (*set_acl)(struct inode *, struct posix_acl *, int); - - /* WARNING: probably going away soon, do not use! */ + int (*get_block)(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create); } ____cacheline_aligned; ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector, -- 2.1.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id 23CDC7F3F for ; Thu, 29 Oct 2015 15:12:41 -0500 (CDT) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay3.corp.sgi.com (Postfix) with ESMTP id A0508AC002 for ; Thu, 29 Oct 2015 13:12:40 -0700 (PDT) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by cuda.sgi.com with ESMTP id lIMQS4vjGB6Qlrj5 for ; Thu, 29 Oct 2015 13:12:38 -0700 (PDT) From: Ross Zwisler Subject: [RFC 11/11] ext4: add ext4_dax_pfn_mkwrite() Date: Thu, 29 Oct 2015 14:12:15 -0600 Message-Id: <1446149535-16200-12-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Jeff Layton , x86@kernel.org, Theodore Ts'o , Andrew Morton , Thomas Gleixner , linux-nvdimm@lists.01.org, Jan Kara , xfs@oss.sgi.com, "J. Bruce Fields" , linux-mm@kvack.org, Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Dan Williams , Matthew Wilcox Add ext4_dax_pfn_mkwrite() that properly protects against freeze, sets the file update time and calls into dax_pfn_mkwrite() to add the newly dirtied page to the radix tree which tracks dirty DAX pages. Signed-off-by: Ross Zwisler --- fs/ext4/file.c | 22 +++++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 54d7729..7f62cad 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -272,11 +272,31 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) ext4_end_io_unwritten); } +static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct inode *inode = file_inode(vma->vm_file); + loff_t size; + int ret; + + sb_start_pagefault(inode->i_sb); + file_update_time(vma->vm_file); + + /* check that the faulting page hasn't raced with truncate */ + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; + if (vmf->pgoff >= size) + ret = VM_FAULT_SIGBUS; + else + ret = dax_pfn_mkwrite(vma, vmf); + + sb_end_pagefault(inode->i_sb); + return ret; +} + static const struct vm_operations_struct ext4_dax_vm_ops = { .fault = ext4_dax_fault, .pmd_fault = ext4_dax_pmd_fault, .page_mkwrite = ext4_dax_mkwrite, - .pfn_mkwrite = dax_pfn_mkwrite, + .pfn_mkwrite = ext4_dax_pfn_mkwrite, }; #else #define ext4_dax_vm_ops ext4_file_vm_ops -- 2.1.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id 93C6C7F50 for ; Thu, 29 Oct 2015 15:12:37 -0500 (CDT) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay3.corp.sgi.com (Postfix) with ESMTP id 39D90AC002 for ; Thu, 29 Oct 2015 13:12:34 -0700 (PDT) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by cuda.sgi.com with ESMTP id ATDmGVXsJN0tIfbS for ; Thu, 29 Oct 2015 13:12:33 -0700 (PDT) From: Ross Zwisler Subject: [RFC 05/11] mm: add follow_pte_pmd() Date: Thu, 29 Oct 2015 14:12:09 -0600 Message-Id: <1446149535-16200-6-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Jeff Layton , x86@kernel.org, Theodore Ts'o , Andrew Morton , Thomas Gleixner , linux-nvdimm@lists.01.org, Jan Kara , xfs@oss.sgi.com, "J. Bruce Fields" , linux-mm@kvack.org, Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Dan Williams , Matthew Wilcox Similar to follow_pte(), follow_pte_pmd() allows either a PTE leaf or a huge page PMD leaf to be found and returned. Signed-off-by: Ross Zwisler --- include/linux/mm.h | 2 ++ mm/memory.c | 41 +++++++++++++++++++++++++++++++++-------- 2 files changed, 35 insertions(+), 8 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 80001de..393441c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1166,6 +1166,8 @@ int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma); void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows); +int follow_pte_pmd(struct mm_struct *mm, unsigned long address, + pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp); int follow_pfn(struct vm_area_struct *vma, unsigned long address, unsigned long *pfn); int follow_phys(struct vm_area_struct *vma, unsigned long address, diff --git a/mm/memory.c b/mm/memory.c index deb679c..c2b8c0a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3512,8 +3512,8 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) } #endif /* __PAGETABLE_PMD_FOLDED */ -static int __follow_pte(struct mm_struct *mm, unsigned long address, - pte_t **ptepp, spinlock_t **ptlp) +static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address, + pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp) { pgd_t *pgd; pud_t *pud; @@ -3529,12 +3529,23 @@ static int __follow_pte(struct mm_struct *mm, unsigned long address, goto out; pmd = pmd_offset(pud, address); - VM_BUG_ON(pmd_trans_huge(*pmd)); - if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) - goto out; - /* We cannot handle huge page PFN maps. Luckily they don't exist. */ - if (pmd_huge(*pmd)) + if (pmd_huge(*pmd)) { + if (!pmdpp) + goto out; + + *ptlp = pmd_lock(mm, pmd); + if (pmd_huge(*pmd)) { + /* Success, we found a large PTE */ + *pmdpp = pmd; + return 0; + } + /* Somebody removed the PMD entry, try it as a pte */ + spin_unlock(*ptlp); + } + + /* FIXME: pmd_bad() is sometimes set for DAX pmds? */ + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) goto out; ptep = pte_offset_map_lock(mm, pmd, address, ptlp); @@ -3557,9 +3568,23 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address, /* (void) is needed to make gcc happy */ (void) __cond_lock(*ptlp, - !(res = __follow_pte(mm, address, ptepp, ptlp))); + !(res = __follow_pte_pmd(mm, address, ptepp, NULL, + ptlp))); + return res; +} + +int follow_pte_pmd(struct mm_struct *mm, unsigned long address, + pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp) +{ + int res; + + /* (void) is needed to make gcc happy */ + (void) __cond_lock(*ptlp, + !(res = __follow_pte_pmd(mm, address, ptepp, pmdpp, + ptlp))); return res; } +EXPORT_SYMBOL(follow_pte_pmd); /** * follow_pfn - look up PFN at a user virtual address -- 2.1.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 9F8907F51 for ; Thu, 29 Oct 2015 15:12:39 -0500 (CDT) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay2.corp.sgi.com (Postfix) with ESMTP id 63F1330404E for ; Thu, 29 Oct 2015 13:12:39 -0700 (PDT) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by cuda.sgi.com with ESMTP id KeH8fVgbN1C58eTq for ; Thu, 29 Oct 2015 13:12:37 -0700 (PDT) From: Ross Zwisler Subject: [RFC 09/11] dax: add support for fsync/sync Date: Thu, 29 Oct 2015 14:12:13 -0600 Message-Id: <1446149535-16200-10-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Jeff Layton , x86@kernel.org, Theodore Ts'o , Andrew Morton , Thomas Gleixner , linux-nvdimm@lists.01.org, Jan Kara , xfs@oss.sgi.com, "J. Bruce Fields" , linux-mm@kvack.org, Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Dan Williams , Matthew Wilcox To properly handle fsync/msync in an efficient way DAX needs to track dirty pages so it is able to flush them durably to media on demand. The tracking of dirty pages is done via the radix tree in struct address_space. This radix tree is already used by the page writeback infrastructure for tracking dirty pages associated with an open file, and it already has support for exceptional (non struct page*) entries. We build upon these features to add exceptional entries to the radix tree for DAX dirty PMD or PTE pages at fault time. When called as part of the msync/fsync flush path DAX queries the radix tree for dirty entries, flushing them and then marking the PTE or PMD page table entries as clean. The step of cleaning the PTE or PMD entries is necessary so that on subsequent writes to the same page we get a new write fault, allowing us to re-enter the DAX tag into the radix tree. Signed-off-by: Ross Zwisler --- fs/dax.c | 161 ++++++++++++++++++++++++++++++++++++++++++++++++++-- include/linux/dax.h | 1 + mm/huge_memory.c | 14 ++--- mm/page-writeback.c | 9 +++ 4 files changed, 172 insertions(+), 13 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 131fd35a..3b38aff 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -24,10 +24,13 @@ #include #include #include +#include #include +#include #include #include #include +#include /* * dax_clear_blocks() is called from within transaction context from XFS, @@ -287,6 +290,42 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh, return 0; } +static int dax_dirty_pgoff(struct vm_area_struct *vma, + struct address_space *mapping, unsigned long pgoff, + bool pmd_entry) +{ + struct radix_tree_root *page_tree = &mapping->page_tree; + void *tag; + int error = 0; + + __mark_inode_dirty(file_inode(vma->vm_file), I_DIRTY_PAGES); + + spin_lock_irq(&mapping->tree_lock); + + tag = radix_tree_lookup(page_tree, pgoff); + if (tag) { + if (pmd_entry && tag == RADIX_TREE_DAX_PTE) { + radix_tree_delete(&mapping->page_tree, pgoff); + mapping->nrdax--; + } else + goto out; + } + + if (pmd_entry) + error = radix_tree_insert(page_tree, pgoff, RADIX_TREE_DAX_PMD); + else + error = radix_tree_insert(page_tree, pgoff, RADIX_TREE_DAX_PTE); + + if (error) + goto out; + + mapping->nrdax++; + radix_tree_tag_set(page_tree, pgoff, PAGECACHE_TAG_DIRTY); + out: + spin_unlock_irq(&mapping->tree_lock); + return error; +} + static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, struct vm_area_struct *vma, struct vm_fault *vmf) { @@ -450,6 +489,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, delete_from_page_cache(page); unlock_page(page); page_cache_release(page); + page = NULL; } /* @@ -463,6 +503,13 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, * as for normal BH based IO completions. */ error = dax_insert_mapping(inode, &bh, vma, vmf); + if (error) + goto out; + + error = dax_dirty_pgoff(vma, inode->i_mapping, vmf->pgoff, false); + if (error) + goto out; + if (buffer_unwritten(&bh)) { if (complete_unwritten) complete_unwritten(&bh, !error); @@ -537,7 +584,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, pgoff_t size, pgoff; sector_t block, sector; unsigned long pfn; - int result = 0; + int error, result = 0; /* Fall back to PTEs if we're going to COW */ if (write && !(vma->vm_flags & VM_SHARED)) @@ -638,6 +685,10 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, } result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write); + + error = dax_dirty_pgoff(vma, inode->i_mapping, pgoff, true); + if (error) + goto fallback; } out: @@ -693,11 +744,11 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault); */ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) { - struct super_block *sb = file_inode(vma->vm_file)->i_sb; + struct file *file = vma->vm_file; + struct address_space *mapping = file->f_mapping; + struct inode *inode = mapping->host; - sb_start_pagefault(sb); - file_update_time(vma->vm_file); - sb_end_pagefault(sb); + dax_dirty_pgoff(vma, inode->i_mapping, vmf->pgoff, false); return VM_FAULT_NOPAGE; } EXPORT_SYMBOL_GPL(dax_pfn_mkwrite); @@ -772,3 +823,103 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) return dax_zero_page_range(inode, from, length, get_block); } EXPORT_SYMBOL_GPL(dax_truncate_page); + +static int dax_flush_one_mapping(struct address_space *mapping, + struct inode *inode, sector_t block, void *tag) +{ + get_block_t *get_block = inode->i_op->get_block; + struct buffer_head bh; + void __pmem *addr; + int ret; + + BUG_ON(tag != RADIX_TREE_DAX_PMD && tag != RADIX_TREE_DAX_PTE); + + memset(&bh, 0, sizeof(bh)); + + if (tag == RADIX_TREE_DAX_PMD) + bh.b_size = PMD_SIZE; + else + bh.b_size = PAGE_SIZE; + + ret = get_block(inode, block, &bh, false); + BUG_ON(!buffer_written(&bh)); + if (ret < 0) + return ret; + + ret = dax_get_addr(&bh, &addr, inode->i_blkbits); + if (ret < 0) + return ret; + + if (tag == RADIX_TREE_DAX_PMD) + WARN_ON(ret != PMD_SIZE); + else + WARN_ON(ret != PAGE_SIZE); + + wb_cache_pmem(addr, ret); + + spin_lock_irq(&mapping->tree_lock); + radix_tree_delete(&mapping->page_tree, block); + spin_unlock_irq(&mapping->tree_lock); + mapping->nrdax--; + + return pgoff_mkclean(block, mapping); +} + +/* + * flush the mapping to the persistent domain within the byte range of (start, + * end). This is required by data integrity operations to ensure file data is on + * persistent storage prior to completion of the operation. It also requires us + * to clean the mappings (i.e. write -> RO) so that we'll get a new fault when + * the file is written to again so wehave an indication that we need to flush + * the mapping if a data integrity operation takes place. + * + * We don't need commits to storage here - the filesystems will issue flushes + * appropriately at the conclusion of the data integrity operation via REQ_FUA + * writes or blkdev_issue_flush() commands. This requires the DAX block device + * to implement persistent storage domain fencing/commits on receiving a + * REQ_FLUSH or REQ_FUA request so that this works as expected by the higher + * layers. + */ +int dax_flush_mapping(struct address_space *mapping, loff_t start, loff_t end) +{ + struct inode *inode = mapping->host; + pgoff_t indices[PAGEVEC_SIZE]; + struct pagevec pvec; + int i, error; + + pgoff_t start_page = start >> PAGE_CACHE_SHIFT; + pgoff_t end_page = end >> PAGE_CACHE_SHIFT; + + + if (mapping->nrdax == 0) + return 0; + + if (!inode->i_op->get_block) { + WARN_ONCE(1, "Flushing DAX mapping without get_block()!"); + mapping->nrdax = 0; + return 0; + } + + BUG_ON(inode->i_blkbits != PAGE_SHIFT); + + tag_pages_for_writeback(mapping, start_page, end_page); + + pagevec_init(&pvec, 0); + while (1) { + pvec.nr = find_get_entries_tag(mapping, start_page, + PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE, + pvec.pages, indices); + + if (pvec.nr == 0) + break; + + for (i = 0; i < pvec.nr; i++) { + error = dax_flush_one_mapping(mapping, inode, + indices[i], pvec.pages[i]); + if (error) + return error; + } + } + + return 0; +} diff --git a/include/linux/dax.h b/include/linux/dax.h index e9d57f68..5eff476 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -41,4 +41,5 @@ static inline bool dax_mapping(struct address_space *mapping) { return mapping->host && IS_DAX(mapping->host); } +int dax_flush_mapping(struct address_space *mapping, loff_t start, loff_t end); #endif diff --git a/mm/huge_memory.c b/mm/huge_memory.c index bbac913..1b3df56 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -877,15 +877,13 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, spinlock_t *ptl; ptl = pmd_lock(mm, pmd); - if (pmd_none(*pmd)) { - entry = pmd_mkhuge(pfn_pmd(pfn, prot)); - if (write) { - entry = pmd_mkyoung(pmd_mkdirty(entry)); - entry = maybe_pmd_mkwrite(entry, vma); - } - set_pmd_at(mm, addr, pmd, entry); - update_mmu_cache_pmd(vma, addr, pmd); + entry = pmd_mkhuge(pfn_pmd(pfn, prot)); + if (write) { + entry = pmd_mkyoung(pmd_mkdirty(entry)); + entry = maybe_pmd_mkwrite(entry, vma); } + set_pmd_at(mm, addr, pmd, entry); + update_mmu_cache_pmd(vma, addr, pmd); spin_unlock(ptl); } diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2c90357..1801df8 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -37,6 +37,7 @@ #include #include #include +#include #include #include "internal.h" @@ -2338,6 +2339,14 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc) if (wbc->nr_to_write <= 0) return 0; + + if (wbc->sync_mode == WB_SYNC_ALL && dax_mapping(mapping)) { + ret = dax_flush_mapping(mapping, wbc->range_start, + wbc->range_end); + if (ret) + return ret; + } + if (mapping->a_ops->writepages) ret = mapping->a_ops->writepages(mapping, wbc); else -- 2.1.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id D9FCB7F3F for ; Thu, 29 Oct 2015 15:12:40 -0500 (CDT) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay1.corp.sgi.com (Postfix) with ESMTP id C8AF78F804C for ; Thu, 29 Oct 2015 13:12:40 -0700 (PDT) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by cuda.sgi.com with ESMTP id Bl6PBMSt0hytUACb for ; Thu, 29 Oct 2015 13:12:38 -0700 (PDT) From: Ross Zwisler Subject: [RFC 10/11] xfs, ext2: call dax_pfn_mkwrite() on write fault Date: Thu, 29 Oct 2015 14:12:14 -0600 Message-Id: <1446149535-16200-11-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Jeff Layton , x86@kernel.org, Theodore Ts'o , Andrew Morton , Thomas Gleixner , linux-nvdimm@lists.01.org, Jan Kara , xfs@oss.sgi.com, "J. Bruce Fields" , linux-mm@kvack.org, Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Dan Williams , Matthew Wilcox Previously the functionality offered by dax_pfn_mkwrite() was open-coded in XFS and ext2. With the addition of DAX msync/fsync support we need to call dax_pfn_mkwrite() so that the newly writeable page can be added to the radix tree and marked as dirty. Signed-off-by: Ross Zwisler --- fs/ext2/file.c | 4 +++- fs/xfs/xfs_file.c | 9 +++++---- 2 files changed, 8 insertions(+), 5 deletions(-) diff --git a/fs/ext2/file.c b/fs/ext2/file.c index fc1418c..5741e29 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -102,8 +102,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma, { struct inode *inode = file_inode(vma->vm_file); struct ext2_inode_info *ei = EXT2_I(inode); - int ret = VM_FAULT_NOPAGE; loff_t size; + int ret; sb_start_pagefault(inode->i_sb); file_update_time(vma->vm_file); @@ -113,6 +113,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma, size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; if (vmf->pgoff >= size) ret = VM_FAULT_SIGBUS; + else + ret = dax_pfn_mkwrite(vma, vmf); up_read(&ei->dax_sem); sb_end_pagefault(inode->i_sb); diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index f429662..584e4eb 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1575,9 +1575,8 @@ xfs_filemap_pmd_fault( /* * pfn_mkwrite was originally inteneded to ensure we capture time stamp * updates on write faults. In reality, it's need to serialise against - * truncate similar to page_mkwrite. Hence we open-code dax_pfn_mkwrite() - * here and cycle the XFS_MMAPLOCK_SHARED to ensure we serialise the fault - * barrier in place. + * truncate similar to page_mkwrite. Hence we cycle the XFS_MMAPLOCK_SHARED + * to ensure we serialise the fault barrier in place. */ static int xfs_filemap_pfn_mkwrite( @@ -1587,7 +1586,7 @@ xfs_filemap_pfn_mkwrite( struct inode *inode = file_inode(vma->vm_file); struct xfs_inode *ip = XFS_I(inode); - int ret = VM_FAULT_NOPAGE; + int ret; loff_t size; trace_xfs_filemap_pfn_mkwrite(ip); @@ -1600,6 +1599,8 @@ xfs_filemap_pfn_mkwrite( size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; if (vmf->pgoff >= size) ret = VM_FAULT_SIGBUS; + else + ret = dax_pfn_mkwrite(vma, vmf); xfs_iunlock(ip, XFS_MMAPLOCK_SHARED); sb_end_pagefault(inode->i_sb); return ret; -- 2.1.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id D873E7F47 for ; Thu, 29 Oct 2015 17:50:07 -0500 (CDT) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay1.corp.sgi.com (Postfix) with ESMTP id CA19F8F8033 for ; Thu, 29 Oct 2015 15:50:04 -0700 (PDT) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by cuda.sgi.com with ESMTP id 0UmV8VoR3jwsgGWx for ; Thu, 29 Oct 2015 15:50:00 -0700 (PDT) Date: Thu, 29 Oct 2015 16:49:53 -0600 From: Ross Zwisler Subject: Re: [RFC 00/11] DAX fsynx/msync support Message-ID: <20151029224953.GA17933@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler , Dave Chinner Cc: x86@kernel.org, Theodore Ts'o , Andrew Morton , Thomas Gleixner , linux-nvdimm@lists.01.org, Jan Kara , linux-kernel@vger.kernel.org, xfs@oss.sgi.com, "J. Bruce Fields" , linux-mm@kvack.org, Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Dan Williams , linux-ext4@vger.kernel.org, Jeff Layton , Matthew Wilcox On Thu, Oct 29, 2015 at 02:12:04PM -0600, Ross Zwisler wrote: > This patch series adds support for fsync/msync to DAX. > > Patches 1 through 8 add various utilities that the DAX code will eventually > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are > filesystem changes that are needed after the DAX code is added, but these > patches may change slightly as the filesystem fault handling for DAX is > being modified ([1] and [2]). > > I've marked this series as RFC because I'm still testing, but I wanted to > get this out there so people would see the direction I was going and > hopefully comment on any big red flags sooner rather than later. > > I realize that we are getting pretty dang close to the v4.4 merge window, > but I think that if we can get this reviewed and working it's a much better > solution than the "big hammer" approach that blindly flushes entire PMEM > namespaces [3]. > > [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html > [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2 > [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html Hmm...I think I may need to isolate the fsync/msync flushing against races with truncate since we are calling into the filesystem directly with get_block(). Dave (Chinner), does this sound right? Also, one thing I forgot to mention is that these patches are built upon the first version of Dave Chinner's XFS patches and my ext2 patches that deal with the truncate races with DAX. A snapshot of my development tree with these patches applied can be found here: https://github.com/01org/prd/tree/fsync_rfc _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id A222C7F3F for ; Thu, 29 Oct 2015 22:55:42 -0500 (CDT) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay2.corp.sgi.com (Postfix) with ESMTP id 8FD6330406B for ; Thu, 29 Oct 2015 20:55:39 -0700 (PDT) Received: from ipmail05.adl6.internode.on.net (ipmail05.adl6.internode.on.net [150.101.137.143]) by cuda.sgi.com with ESMTP id ZsQ2KfU35opPERhE for ; Thu, 29 Oct 2015 20:55:37 -0700 (PDT) Date: Fri, 30 Oct 2015 14:55:33 +1100 From: Dave Chinner Subject: Re: [RFC 00/11] DAX fsynx/msync support Message-ID: <20151030035533.GU19199@dastard> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler Cc: x86@kernel.org, Theodore Ts'o , Andrew Morton , Thomas Gleixner , linux-nvdimm@lists.01.org, Jan Kara , linux-kernel@vger.kernel.org, xfs@oss.sgi.com, "J. Bruce Fields" , linux-mm@kvack.org, Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Dan Williams , linux-ext4@vger.kernel.org, Jeff Layton , Matthew Wilcox On Thu, Oct 29, 2015 at 02:12:04PM -0600, Ross Zwisler wrote: > This patch series adds support for fsync/msync to DAX. > > Patches 1 through 8 add various utilities that the DAX code will eventually > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are > filesystem changes that are needed after the DAX code is added, but these > patches may change slightly as the filesystem fault handling for DAX is > being modified ([1] and [2]). > > I've marked this series as RFC because I'm still testing, but I wanted to > get this out there so people would see the direction I was going and > hopefully comment on any big red flags sooner rather than later. > > I realize that we are getting pretty dang close to the v4.4 merge window, > but I think that if we can get this reviewed and working it's a much better > solution than the "big hammer" approach that blindly flushes entire PMEM > namespaces [3]. We need the "big hammer" regardless of fsync. If REQ_FLUSH and REQ_FUA don't do the right thing when it comes to ordering journal writes against other IO operations, then the filesystems are not crash safe. i.e. we need REQ_FLUSH/REQ_FUA to commit all outstanding changes back to stable storage, just like they do for existing storage.... > [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html > [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2 > [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html > > Ross Zwisler (11): > pmem: add wb_cache_pmem() to the PMEM API > mm: add pmd_mkclean() > pmem: enable REQ_FLUSH handling > dax: support dirty DAX entries in radix tree > mm: add follow_pte_pmd() > mm: add pgoff_mkclean() > mm: add find_get_entries_tag() > fs: add get_block() to struct inode_operations I don't think this is the right thing to do - it propagates the use of bufferheads as a mapping structure into places where we do not want bufferheads. We've recently added a similar block mapping interface to the export operations structure for PNFS and that uses a "struct iomap" which is far more suited to being an inode operation this. We have plans to move this to the inode operations for various reasons. e.g: multipage write, adding interfaces that support proper mapping of holes, etc: https://www.redhat.com/archives/cluster-devel/2014-October/msg00167.html So after many years of saying no to moving getblocks to the inode operations it seems like the wrong thing to do now considering I want to convert all the DAX code to use iomaps while only 2/3 filesystems are supported... > dax: add support for fsync/sync Why put the dax_flush_mapping() in do_writepages()? Why not call it directly from the filesystem ->fsync() implementations where a getblocks callback could also be provided? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id E29D87F4E for ; Fri, 30 Oct 2015 13:34:10 -0500 (CDT) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay3.corp.sgi.com (Postfix) with ESMTP id 5870DAC002 for ; Fri, 30 Oct 2015 11:34:10 -0700 (PDT) Received: from mail-wi0-f179.google.com (mail-wi0-f179.google.com [209.85.212.179]) by cuda.sgi.com with ESMTP id bwF82aAV6GWh0UUQ (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO) for ; Fri, 30 Oct 2015 11:34:08 -0700 (PDT) Received: by wijp11 with SMTP id p11so16826932wij.0 for ; Fri, 30 Oct 2015 11:34:07 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Date: Fri, 30 Oct 2015 11:34:07 -0700 Message-ID: Subject: Re: [RFC 00/11] DAX fsynx/msync support From: Dan Williams List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler Cc: X86 ML , Theodore Ts'o , Andrew Morton , "linux-nvdimm@lists.01.org" , Jan Kara , "linux-kernel@vger.kernel.org" , xfs@oss.sgi.com, "J. Bruce Fields" , Linux MM , Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel , Matthew Wilcox , Jeff Layton , linux-ext4@vger.kernel.org, Thomas Gleixner , Matthew Wilcox On Thu, Oct 29, 2015 at 1:12 PM, Ross Zwisler wrote: > This patch series adds support for fsync/msync to DAX. > > Patches 1 through 8 add various utilities that the DAX code will eventually > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are > filesystem changes that are needed after the DAX code is added, but these > patches may change slightly as the filesystem fault handling for DAX is > being modified ([1] and [2]). > > I've marked this series as RFC because I'm still testing, but I wanted to > get this out there so people would see the direction I was going and > hopefully comment on any big red flags sooner rather than later. > > I realize that we are getting pretty dang close to the v4.4 merge window, > but I think that if we can get this reviewed and working it's a much better > solution than the "big hammer" approach that blindly flushes entire PMEM > namespaces [3]. > > [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html > [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2 > [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html > > Ross Zwisler (11): > pmem: add wb_cache_pmem() to the PMEM API > mm: add pmd_mkclean() > pmem: enable REQ_FLUSH handling > dax: support dirty DAX entries in radix tree > mm: add follow_pte_pmd() > mm: add pgoff_mkclean() > mm: add find_get_entries_tag() > fs: add get_block() to struct inode_operations > dax: add support for fsync/sync > xfs, ext2: call dax_pfn_mkwrite() on write fault > ext4: add ext4_dax_pfn_mkwrite() This is great to have when the flush-the-world solution ends up killing performance. However, there are a couple mitigating options for workloads that dirty small amounts and flush often that we need to collect data on: 1/ Using cache management and pcommit from userspace to skip calls to msync / fsync. Although, this does not eliminate all calls to blkdev_issue_flush as the fs may invoke it for other reasons. I suspect turning on REQ_FUA support eliminates a number of those invocations, and pmem already satisfies REQ_FUA semantics by default. 2/ Turn off DAX and use the page cache. As Dave mentions [1] we should enable this control on a per-inode basis. I'm folding in this capability as a blkdev_ioctl for the next version of the raw block DAX support patch. It's entirely possible these mitigations won't eliminate the need for a mechanism like this, but I think we have a bit more work to do to find out how bad this is in practice as well as the crossover point where walking the radix becomes prohibitive. We also have the option of tracking open DAX extents in the driver. Even at coarse granularities I'd be surprised if we can't mitigate most of the overhead. [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-October/002598.html _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id A2B257F4E for ; Fri, 30 Oct 2015 13:39:45 -0500 (CDT) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay1.corp.sgi.com (Postfix) with ESMTP id 6EC318F8064 for ; Fri, 30 Oct 2015 11:39:42 -0700 (PDT) Received: from mga14.intel.com ([192.55.52.115]) by cuda.sgi.com with ESMTP id O6TevWMNCAyp1tcY for ; Fri, 30 Oct 2015 11:39:40 -0700 (PDT) Date: Fri, 30 Oct 2015 12:39:38 -0600 From: Ross Zwisler Subject: Re: [RFC 00/11] DAX fsynx/msync support Message-ID: <20151030183938.GC24643@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20151030035533.GU19199@dastard> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dave Chinner Cc: Jeff Layton , x86@kernel.org, Theodore Ts'o , Andrew Morton , Thomas Gleixner , linux-nvdimm@lists.01.org, Jan Kara , linux-kernel@vger.kernel.org, xfs@oss.sgi.com, "J. Bruce Fields" , linux-mm@kvack.org, Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Dan Williams , Matthew Wilcox On Fri, Oct 30, 2015 at 02:55:33PM +1100, Dave Chinner wrote: > On Thu, Oct 29, 2015 at 02:12:04PM -0600, Ross Zwisler wrote: > > This patch series adds support for fsync/msync to DAX. > > > > Patches 1 through 8 add various utilities that the DAX code will eventually > > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are > > filesystem changes that are needed after the DAX code is added, but these > > patches may change slightly as the filesystem fault handling for DAX is > > being modified ([1] and [2]). > > > > I've marked this series as RFC because I'm still testing, but I wanted to > > get this out there so people would see the direction I was going and > > hopefully comment on any big red flags sooner rather than later. > > > > I realize that we are getting pretty dang close to the v4.4 merge window, > > but I think that if we can get this reviewed and working it's a much better > > solution than the "big hammer" approach that blindly flushes entire PMEM > > namespaces [3]. > > We need the "big hammer" regardless of fsync. If REQ_FLUSH and > REQ_FUA don't do the right thing when it comes to ordering journal > writes against other IO operations, then the filesystems are not > crash safe. i.e. we need REQ_FLUSH/REQ_FUA to commit all outstanding > changes back to stable storage, just like they do for existing > storage.... I think that what I've got here (when it's fully working) will protect all the cases that we need. AFAIK there are three ways that data can be written to a PMEM namespace: 1) Through the PMEM driver via either pmem_make_request(), pmem_rw_page() or pmem_rw_bytes(). All of these paths sync the newly written data durably to media before the I/O completes so they shouldn't have any reliance on REQ_FUA/REQ_FLUSH. 2) Through the DAX I/O path, dax_io(). As with PMEM we flush the newly written data durably to media before the I/O operation completes, so this path shouldn't have any reliance on REQ_FUA/REQ_FLUSH. 3) Through mmaps set up by DAX. This is the path we are trying to protect with the dirty page tracking and flushing in this patch set, and I think that this is the only path that has reliance on REQ_FLUSH. The goal of this set is to have the cache writeback all happen as part of the fsync/msync handling, and then have the REQ_FLUSH just provide the trailing wmb_pmem(). My guess is that XFS metadata writes happen via path 1), down through the PMEM driver. Am I missing anything, or should we be good to go? > > [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html > > [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2 > > [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html > > > > Ross Zwisler (11): > > pmem: add wb_cache_pmem() to the PMEM API > > mm: add pmd_mkclean() > > pmem: enable REQ_FLUSH handling > > dax: support dirty DAX entries in radix tree > > mm: add follow_pte_pmd() > > mm: add pgoff_mkclean() > > mm: add find_get_entries_tag() > > fs: add get_block() to struct inode_operations > > I don't think this is the right thing to do - it propagates the use > of bufferheads as a mapping structure into places where we do not > want bufferheads. We've recently added a similar block mapping > interface to the export operations structure for PNFS and that uses > a "struct iomap" which is far more suited to being an inode > operation this. > > We have plans to move this to the inode operations for various > reasons. e.g: multipage write, adding interfaces that support proper > mapping of holes, etc: > > https://www.redhat.com/archives/cluster-devel/2014-October/msg00167.html > > So after many years of saying no to moving getblocks to the inode > operations it seems like the wrong thing to do now considering I > want to convert all the DAX code to use iomaps while only 2/3 > filesystems are supported... Okay, I'll take a look at this interface. I also think that we may need to flow through the filesystem before going into the DAX code so that we can serialize our flushing with respect to extent manipulation, as we had to do with our DAX fault paths. > > dax: add support for fsync/sync > > Why put the dax_flush_mapping() in do_writepages()? Why not call it > directly from the filesystem ->fsync() implementations where a > getblocks callback could also be provided? Because that's where you put it in your example. :) https://lists.01.org/pipermail/linux-nvdimm/2015-March/000341.html Moving it into the filesystem where we know about get_block() is probably the right thing to do - I'll check it out. Thanks! _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 573D17F4E for ; Fri, 30 Oct 2015 14:43:07 -0500 (CDT) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay2.corp.sgi.com (Postfix) with ESMTP id 45AD4304066 for ; Fri, 30 Oct 2015 12:43:04 -0700 (PDT) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by cuda.sgi.com with ESMTP id eQQwBjRqrww73WW4 for ; Fri, 30 Oct 2015 12:43:02 -0700 (PDT) Date: Fri, 30 Oct 2015 13:43:00 -0600 From: Ross Zwisler Subject: Re: [RFC 00/11] DAX fsynx/msync support Message-ID: <20151030194300.GA22670@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dan Williams Cc: X86 ML , Theodore Ts'o , Andrew Morton , Thomas Gleixner , "linux-nvdimm@lists.01.org" , Jan Kara , "linux-kernel@vger.kernel.org" , xfs@oss.sgi.com, "J. Bruce Fields" , Linux MM , Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel , Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Jeff Layton , Matthew Wilcox On Fri, Oct 30, 2015 at 11:34:07AM -0700, Dan Williams wrote: > On Thu, Oct 29, 2015 at 1:12 PM, Ross Zwisler > wrote: > > This patch series adds support for fsync/msync to DAX. > > > > Patches 1 through 8 add various utilities that the DAX code will eventually > > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are > > filesystem changes that are needed after the DAX code is added, but these > > patches may change slightly as the filesystem fault handling for DAX is > > being modified ([1] and [2]). > > > > I've marked this series as RFC because I'm still testing, but I wanted to > > get this out there so people would see the direction I was going and > > hopefully comment on any big red flags sooner rather than later. > > > > I realize that we are getting pretty dang close to the v4.4 merge window, > > but I think that if we can get this reviewed and working it's a much better > > solution than the "big hammer" approach that blindly flushes entire PMEM > > namespaces [3]. > > > > [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html > > [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2 > > [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html > > > > Ross Zwisler (11): > > pmem: add wb_cache_pmem() to the PMEM API > > mm: add pmd_mkclean() > > pmem: enable REQ_FLUSH handling > > dax: support dirty DAX entries in radix tree > > mm: add follow_pte_pmd() > > mm: add pgoff_mkclean() > > mm: add find_get_entries_tag() > > fs: add get_block() to struct inode_operations > > dax: add support for fsync/sync > > xfs, ext2: call dax_pfn_mkwrite() on write fault > > ext4: add ext4_dax_pfn_mkwrite() > > This is great to have when the flush-the-world solution ends up > killing performance. However, there are a couple mitigating options > for workloads that dirty small amounts and flush often that we need to > collect data on: > > 1/ Using cache management and pcommit from userspace to skip calls to > msync / fsync. Although, this does not eliminate all calls to > blkdev_issue_flush as the fs may invoke it for other reasons. I > suspect turning on REQ_FUA support eliminates a number of those > invocations, and pmem already satisfies REQ_FUA semantics by default. Sure, I'll turn on REQ_FUA in addition to REQ_FLUSH - I agree that PMEM already handles the requirements of REQ_FUA, but I didn't realize that it might reduce the number of REQ_FLUSH bios we receive. > 2/ Turn off DAX and use the page cache. As Dave mentions [1] we > should enable this control on a per-inode basis. I'm folding in this > capability as a blkdev_ioctl for the next version of the raw block DAX > support patch. Umm...I think you just said "the way to avoid this delay is to just not use DAX". :) I don't think this is where we want to go - we are trying to make DAX better, not abandon it. > It's entirely possible these mitigations won't eliminate the need for > a mechanism like this, but I think we have a bit more work to do to > find out how bad this is in practice as well as the crossover point > where walking the radix becomes prohibitive. I'm guessing a single run through xfstests will be enough to convince you that the "big hammer" approach is untenable. Tests that used to take a second now take several minutes, at least in my VM testing environment... And that's only using a tiny 4GiB namespace. Yes, we can distribute the cost over multiple CPUs, but that just distributes the problem and doesn't reduce the overall work that needs to be done. Ultimately I think that looping through multiple GiB or even TiB of cache lines and blindly writing them back individually on every REQ_FLUSH is going to be a deal breaker. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 7FCEB7F59 for ; Sun, 1 Nov 2015 17:30:10 -0600 (CST) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay2.corp.sgi.com (Postfix) with ESMTP id 73186304062 for ; Sun, 1 Nov 2015 15:30:07 -0800 (PST) Received: from ipmail07.adl2.internode.on.net (ipmail07.adl2.internode.on.net [150.101.137.131]) by cuda.sgi.com with ESMTP id 0qVvIG1CR0LfUR7k for ; Sun, 01 Nov 2015 15:30:03 -0800 (PST) Date: Mon, 2 Nov 2015 10:29:48 +1100 From: Dave Chinner Subject: Re: [RFC 00/11] DAX fsynx/msync support Message-ID: <20151101232948.GF10656@dastard> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20151030183938.GC24643@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dan Williams , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox On Fri, Oct 30, 2015 at 12:39:38PM -0600, Ross Zwisler wrote: > On Fri, Oct 30, 2015 at 02:55:33PM +1100, Dave Chinner wrote: > > On Thu, Oct 29, 2015 at 02:12:04PM -0600, Ross Zwisler wrote: > > > This patch series adds support for fsync/msync to DAX. > > > > > > Patches 1 through 8 add various utilities that the DAX code will eventually > > > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are > > > filesystem changes that are needed after the DAX code is added, but these > > > patches may change slightly as the filesystem fault handling for DAX is > > > being modified ([1] and [2]). > > > > > > I've marked this series as RFC because I'm still testing, but I wanted to > > > get this out there so people would see the direction I was going and > > > hopefully comment on any big red flags sooner rather than later. > > > > > > I realize that we are getting pretty dang close to the v4.4 merge window, > > > but I think that if we can get this reviewed and working it's a much better > > > solution than the "big hammer" approach that blindly flushes entire PMEM > > > namespaces [3]. > > > > We need the "big hammer" regardless of fsync. If REQ_FLUSH and > > REQ_FUA don't do the right thing when it comes to ordering journal > > writes against other IO operations, then the filesystems are not > > crash safe. i.e. we need REQ_FLUSH/REQ_FUA to commit all outstanding > > changes back to stable storage, just like they do for existing > > storage.... > > I think that what I've got here (when it's fully working) will protect all the > cases that we need. > > AFAIK there are three ways that data can be written to a PMEM namespace: > > 1) Through the PMEM driver via either pmem_make_request(), pmem_rw_page() or > pmem_rw_bytes(). All of these paths sync the newly written data durably to > media before the I/O completes so they shouldn't have any reliance on > REQ_FUA/REQ_FLUSH. I suspect that not all future pmem devices will use this driver/interface/semantics. Further, REQ_FLUSH/REQ_FUA are more than just "put the data on stable storage" commands. They are also IO barriers that affect scheduling of IOs in progress and in the request queues. A REQ_FLUSH/REQ_FUA IO cannot be dispatched before all prior IO has been dispatched and drained from the request queue, and IO submitted after a queued REQ_FLUSH/REQ_FUA cannot be scheduled ahead of the queued REQ_FLUSH/REQ_FUA operation. IOWs, REQ_FUA/REQ_FLUSH not only guarantee data is on stable storage, they also guarantee the order of IO dispatch and completion when concurrent IO is in progress. > 2) Through the DAX I/O path, dax_io(). As with PMEM we flush the newly > written data durably to media before the I/O operation completes, so this path > shouldn't have any reliance on REQ_FUA/REQ_FLUSH. That's fine, but that's not the problem we need solved ;) > 3) Through mmaps set up by DAX. This is the path we are trying to protect > with the dirty page tracking and flushing in this patch set, and I think that > this is the only path that has reliance on REQ_FLUSH. Quite possibly this is the case for the current intel pmem driver, but I don't look at the functionality from that perspective. Dirty page tracking is needed to enable "data writeback", whether it be CPU cachelines via pcommit() or dirty pages via submit_bio(). How the pages get dirty is irrelevant - the fact is they are dirty and we need to do /something/ to ensure they are correctly written back to the storage layer. REQ_FLUSH is needed to guarantee all data that has been written back to the storage layer is persistent in that layer. How a /driver/ manages that is up to the driver - the actual implementation is irrelevant to the higher layers. i.e. what we are concerned about at the filesystem level is that: a) "data writeback" is started correctly; b) the "data writeback" is completed; and c) volatile caches are completely flushed before we write the metadata changes that reference that data to the journal via FUA e.g. we could have pmem, but we are using buffered IO (i.e. non-DAX) and a hardware driver that doesn't flush CPU cachelines in the physical IO path. This requires that driver to flush CPU cachelines and place memory barriers in REQ_FLUSH operations, as well as after writing the data in REQ_FUA operations. Yes, this is different to the way the intel pmem drivers work (i.e. as noted in 1) above), but it is /not wrong/ as long as REQ_FLUSH/REQ_FUA also flush dirty cpu cachelines. IOWs, the high level code we write that implements fsync for DAX needs to be generic enough so that when something slightly different comes along we don't have to throw everything away and start again. I think your code will end up being generic enough to handle this, but let's make sure we don't implement something that can only work with pmem hardware/drivers that do all IO as fully synchronous to the stable domain... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id A608E7F4E for ; Fri, 30 Oct 2015 14:51:49 -0500 (CDT) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay1.corp.sgi.com (Postfix) with ESMTP id 94A248F8035 for ; Fri, 30 Oct 2015 12:51:46 -0700 (PDT) Received: from mail-wi0-f178.google.com (mail-wi0-f178.google.com [209.85.212.178]) by cuda.sgi.com with ESMTP id 9dGRdfF9XARKHPKa (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO) for ; Fri, 30 Oct 2015 12:51:41 -0700 (PDT) Received: by wicfv8 with SMTP id fv8so17566136wic.0 for ; Fri, 30 Oct 2015 12:51:40 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20151030194300.GA22670@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030194300.GA22670@linux.intel.com> Date: Fri, 30 Oct 2015 12:51:40 -0700 Message-ID: Subject: Re: [RFC 00/11] DAX fsynx/msync support From: Dan Williams List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler , Dan Williams , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox On Fri, Oct 30, 2015 at 12:43 PM, Ross Zwisler wrote: > On Fri, Oct 30, 2015 at 11:34:07AM -0700, Dan Williams wrote: >> On Thu, Oct 29, 2015 at 1:12 PM, Ross Zwisler >> wrote: >> > This patch series adds support for fsync/msync to DAX. >> > >> > Patches 1 through 8 add various utilities that the DAX code will eventually >> > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are >> > filesystem changes that are needed after the DAX code is added, but these >> > patches may change slightly as the filesystem fault handling for DAX is >> > being modified ([1] and [2]). >> > >> > I've marked this series as RFC because I'm still testing, but I wanted to >> > get this out there so people would see the direction I was going and >> > hopefully comment on any big red flags sooner rather than later. >> > >> > I realize that we are getting pretty dang close to the v4.4 merge window, >> > but I think that if we can get this reviewed and working it's a much better >> > solution than the "big hammer" approach that blindly flushes entire PMEM >> > namespaces [3]. >> > >> > [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html >> > [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2 >> > [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html >> > >> > Ross Zwisler (11): >> > pmem: add wb_cache_pmem() to the PMEM API >> > mm: add pmd_mkclean() >> > pmem: enable REQ_FLUSH handling >> > dax: support dirty DAX entries in radix tree >> > mm: add follow_pte_pmd() >> > mm: add pgoff_mkclean() >> > mm: add find_get_entries_tag() >> > fs: add get_block() to struct inode_operations >> > dax: add support for fsync/sync >> > xfs, ext2: call dax_pfn_mkwrite() on write fault >> > ext4: add ext4_dax_pfn_mkwrite() >> >> This is great to have when the flush-the-world solution ends up >> killing performance. However, there are a couple mitigating options >> for workloads that dirty small amounts and flush often that we need to >> collect data on: >> >> 1/ Using cache management and pcommit from userspace to skip calls to >> msync / fsync. Although, this does not eliminate all calls to >> blkdev_issue_flush as the fs may invoke it for other reasons. I >> suspect turning on REQ_FUA support eliminates a number of those >> invocations, and pmem already satisfies REQ_FUA semantics by default. > > Sure, I'll turn on REQ_FUA in addition to REQ_FLUSH - I agree that PMEM > already handles the requirements of REQ_FUA, but I didn't realize that it > might reduce the number of REQ_FLUSH bios we receive. I'll let Dave chime in, but a lot of the flush requirements come from guaranteeing the state of the metadata, if metadata updates can be done with REQ_FUA then there is no subsequent need to flush. >> 2/ Turn off DAX and use the page cache. As Dave mentions [1] we >> should enable this control on a per-inode basis. I'm folding in this >> capability as a blkdev_ioctl for the next version of the raw block DAX >> support patch. > > Umm...I think you just said "the way to avoid this delay is to just not use > DAX". :) I don't think this is where we want to go - we are trying to make > DAX better, not abandon it. That's a bit of an exaggeration. Avoiding DAX where it is not necessary is not "abandoning DAX", it's using the right tool for the job. Page cache is fine for many cases. >> It's entirely possible these mitigations won't eliminate the need for >> a mechanism like this, but I think we have a bit more work to do to >> find out how bad this is in practice as well as the crossover point >> where walking the radix becomes prohibitive. > > I'm guessing a single run through xfstests will be enough to convince you that > the "big hammer" approach is untenable. Tests that used to take a second now > take several minutes, at least in my VM testing environment... And that's > only using a tiny 4GiB namespace. > > Yes, we can distribute the cost over multiple CPUs, but that just distributes > the problem and doesn't reduce the overall work that needs to be done. > Ultimately I think that looping through multiple GiB or even TiB of cache > lines and blindly writing them back individually on every REQ_FLUSH is going > to be a deal breaker. Right, part of the problem is that the driver doesn't know which blocks are actively DAX mapped. I think we can incrementally fix that without requiring DAX specific fsync/msync handling code for each fs that supports DAX. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id 8D8277F59 for ; Sun, 1 Nov 2015 17:36:53 -0600 (CST) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay1.corp.sgi.com (Postfix) with ESMTP id 7E8788F8033 for ; Sun, 1 Nov 2015 15:36:50 -0800 (PST) Received: from ipmail07.adl2.internode.on.net (ipmail07.adl2.internode.on.net [150.101.137.131]) by cuda.sgi.com with ESMTP id XouhuFOGSgIqb5ey for ; Sun, 01 Nov 2015 15:36:47 -0800 (PST) Date: Mon, 2 Nov 2015 10:36:32 +1100 From: Dave Chinner Subject: Re: [RFC 00/11] DAX fsynx/msync support Message-ID: <20151101233632.GG10656@dastard> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030194300.GA22670@linux.intel.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dan Williams Cc: X86 ML , Theodore Ts'o , Andrew Morton , Thomas Gleixner , "linux-nvdimm@lists.01.org" , Jan Kara , "linux-kernel@vger.kernel.org" , xfs@oss.sgi.com, "J. Bruce Fields" , Linux MM , Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel , Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Jeff Layton , Matthew Wilcox On Fri, Oct 30, 2015 at 12:51:40PM -0700, Dan Williams wrote: > On Fri, Oct 30, 2015 at 12:43 PM, Ross Zwisler > wrote: > > On Fri, Oct 30, 2015 at 11:34:07AM -0700, Dan Williams wrote: > >> This is great to have when the flush-the-world solution ends up > >> killing performance. However, there are a couple mitigating options > >> for workloads that dirty small amounts and flush often that we need to > >> collect data on: > >> > >> 1/ Using cache management and pcommit from userspace to skip calls to > >> msync / fsync. Although, this does not eliminate all calls to > >> blkdev_issue_flush as the fs may invoke it for other reasons. I > >> suspect turning on REQ_FUA support eliminates a number of those > >> invocations, and pmem already satisfies REQ_FUA semantics by default. > > > > Sure, I'll turn on REQ_FUA in addition to REQ_FLUSH - I agree that PMEM > > already handles the requirements of REQ_FUA, but I didn't realize that it > > might reduce the number of REQ_FLUSH bios we receive. > > I'll let Dave chime in, but a lot of the flush requirements come from > guaranteeing the state of the metadata, if metadata updates can be > done with REQ_FUA then there is no subsequent need to flush. No need for cache flushes in this case, but we still need the IO scheduler to order such operations correctly. > >> 2/ Turn off DAX and use the page cache. As Dave mentions [1] we > >> should enable this control on a per-inode basis. I'm folding in this > >> capability as a blkdev_ioctl for the next version of the raw block DAX > >> support patch. > > > > Umm...I think you just said "the way to avoid this delay is to just not use > > DAX". :) I don't think this is where we want to go - we are trying to make > > DAX better, not abandon it. > > That's a bit of an exaggeration. Avoiding DAX where it is not > necessary is not "abandoning DAX", it's using the right tool for the > job. Page cache is fine for many cases. Think btrfs - any file that uses COW can't use DAX for write. Everything has to be buffered, unless the nodatacow flag is set, and then DAX can be used. Indeed, on ext4 if you are using file encryption you can't use DAX. IOWs, we already know that we have to support mixed DAX/non-DAX access within the same filesystem, so I'm with Dan here... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id 6F47829DF5 for ; Mon, 2 Nov 2015 08:22:23 -0600 (CST) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay1.corp.sgi.com (Postfix) with ESMTP id 4D28F8F8033 for ; Mon, 2 Nov 2015 06:22:23 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by cuda.sgi.com with ESMTP id nDSXQs6lHv8cxaUK (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Mon, 02 Nov 2015 06:22:22 -0800 (PST) From: Jeff Moyer Subject: Re: [RFC 00/11] DAX fsynx/msync support References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> <20151101232948.GF10656@dastard> Date: Mon, 02 Nov 2015 09:22:15 -0500 In-Reply-To: <20151101232948.GF10656@dastard> (Dave Chinner's message of "Mon, 2 Nov 2015 10:29:48 +1100") Message-ID: MIME-Version: 1.0 List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dave Chinner Cc: Jeff Layton , x86@kernel.org, Theodore Ts'o , Andrew Morton , Thomas Gleixner , linux-nvdimm@ml01.01.org, Jan Kara , linux-kernel@vger.kernel.org, xfs@oss.sgi.com, "J. Bruce Fields" , linux-mm@kvack.org, Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Dan Williams , Matthew Wilcox Dave Chinner writes: > Further, REQ_FLUSH/REQ_FUA are more than just "put the data on stable > storage" commands. They are also IO barriers that affect scheduling > of IOs in progress and in the request queues. A REQ_FLUSH/REQ_FUA > IO cannot be dispatched before all prior IO has been dispatched and > drained from the request queue, and IO submitted after a queued > REQ_FLUSH/REQ_FUA cannot be scheduled ahead of the queued > REQ_FLUSH/REQ_FUA operation. > > IOWs, REQ_FUA/REQ_FLUSH not only guarantee data is on stable > storage, they also guarantee the order of IO dispatch and > completion when concurrent IO is in progress. This hasn't been the case for several years, now. It used to work that way, and that was deemed a big performance problem. Since file systems already issued and waited for all I/O before sending down a barrier, we decided to get rid of the I/O ordering pieces of barriers (and stop calling them barriers). See commit 28e7d184521 (block: drop barrier ordering by queue draining). Cheers, Jeff _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id 9245029DF5 for ; Mon, 2 Nov 2015 14:10:50 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay1.corp.sgi.com (Postfix) with ESMTP id 73C538F8035 for ; Mon, 2 Nov 2015 12:10:47 -0800 (PST) Received: from ipmail07.adl2.internode.on.net (ipmail07.adl2.internode.on.net [150.101.137.131]) by cuda.sgi.com with ESMTP id aFzwLt9VIARhNHd8 for ; Mon, 02 Nov 2015 12:10:44 -0800 (PST) Date: Tue, 3 Nov 2015 07:10:29 +1100 From: Dave Chinner Subject: Re: [RFC 00/11] DAX fsynx/msync support Message-ID: <20151102201029.GI10656@dastard> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> <20151101232948.GF10656@dastard> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Jeff Moyer Cc: Jeff Layton , x86@kernel.org, Theodore Ts'o , Andrew Morton , Thomas Gleixner , linux-nvdimm@ml01.01.org, Jan Kara , linux-kernel@vger.kernel.org, xfs@oss.sgi.com, "J. Bruce Fields" , linux-mm@kvack.org, Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Dan Williams , Matthew Wilcox On Mon, Nov 02, 2015 at 09:22:15AM -0500, Jeff Moyer wrote: > Dave Chinner writes: > > > Further, REQ_FLUSH/REQ_FUA are more than just "put the data on stable > > storage" commands. They are also IO barriers that affect scheduling > > of IOs in progress and in the request queues. A REQ_FLUSH/REQ_FUA > > IO cannot be dispatched before all prior IO has been dispatched and > > drained from the request queue, and IO submitted after a queued > > REQ_FLUSH/REQ_FUA cannot be scheduled ahead of the queued > > REQ_FLUSH/REQ_FUA operation. > > > > IOWs, REQ_FUA/REQ_FLUSH not only guarantee data is on stable > > storage, they also guarantee the order of IO dispatch and > > completion when concurrent IO is in progress. > > This hasn't been the case for several years, now. It used to work that > way, and that was deemed a big performance problem. Since file systems > already issued and waited for all I/O before sending down a barrier, we > decided to get rid of the I/O ordering pieces of barriers (and stop > calling them barriers). > > See commit 28e7d184521 (block: drop barrier ordering by queue draining). Yes, I realise that, even if I wasn't very clear about how I wrote it. ;) Correct me if I'm wrong: AFAIA, dispatch ordering (i.e. the "IO barrier") is still enforced by the scheduler via REQ_FUA|REQ_FLUSH -> ELEVATOR_INSERT_FLUSH -> REQ_SOFTBARRIER and subsequent IO scheduler calls to elv_dispatch_sort() that don't pass REQ_SOFTBARRIER in the queue. IOWs, if we queue a bunch of REQ_WRITE IOs followed by a REQ_WRITE|REQ_FLUSH IO, all of the prior REQ_WRITE IOs will be dispatched before the REQ_WRITE|REQ_FLUSH IO and hence be captured by the cache flush. Hence once the filesystem has waited on the REQ_WRITE|REQ_FLUSH IO to complete, we know that all the earlier REQ_WRITE IOs are on stable storage, too. Hence there's no need for the elevator to drain the queue to guarantee completion ordering - the dispatch ordering and flush/fua write semantics guarantee that when the flush/fua completes, all the IOs dispatch prior to that flush/fua write are also on stable storage... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id DBADE29DF5 for ; Mon, 2 Nov 2015 15:02:56 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay2.corp.sgi.com (Postfix) with ESMTP id CC0E0304053 for ; Mon, 2 Nov 2015 13:02:53 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by cuda.sgi.com with ESMTP id G2vloKt9rmjsrunv (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Mon, 02 Nov 2015 13:02:52 -0800 (PST) From: Jeff Moyer Subject: Re: [RFC 00/11] DAX fsynx/msync support References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> <20151101232948.GF10656@dastard> <20151102201029.GI10656@dastard> Date: Mon, 02 Nov 2015 16:02:48 -0500 In-Reply-To: <20151102201029.GI10656@dastard> (Dave Chinner's message of "Tue, 3 Nov 2015 07:10:29 +1100") Message-ID: MIME-Version: 1.0 List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dave Chinner Cc: linux-nvdimm@ml01.01.org, "J. Bruce Fields" , linux-mm@kvack.org, Andreas Dilger , "H. Peter Anvin" , Jeff Layton , Dan Williams , x86@kernel.org, Ingo Molnar , Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, xfs@oss.sgi.com, Alexander Viro , Thomas Gleixner , axboe@kernel.dk, Theodore Ts'o , linux-kernel@vger.kernel.org, Jan Kara , linux-fsdevel@vger.kernel.org, Andrew Morton , Matthew Wilcox Dave Chinner writes: > On Mon, Nov 02, 2015 at 09:22:15AM -0500, Jeff Moyer wrote: >> Dave Chinner writes: >> >> > Further, REQ_FLUSH/REQ_FUA are more than just "put the data on stable >> > storage" commands. They are also IO barriers that affect scheduling >> > of IOs in progress and in the request queues. A REQ_FLUSH/REQ_FUA >> > IO cannot be dispatched before all prior IO has been dispatched and >> > drained from the request queue, and IO submitted after a queued >> > REQ_FLUSH/REQ_FUA cannot be scheduled ahead of the queued >> > REQ_FLUSH/REQ_FUA operation. >> > >> > IOWs, REQ_FUA/REQ_FLUSH not only guarantee data is on stable >> > storage, they also guarantee the order of IO dispatch and >> > completion when concurrent IO is in progress. >> >> This hasn't been the case for several years, now. It used to work that >> way, and that was deemed a big performance problem. Since file systems >> already issued and waited for all I/O before sending down a barrier, we >> decided to get rid of the I/O ordering pieces of barriers (and stop >> calling them barriers). >> >> See commit 28e7d184521 (block: drop barrier ordering by queue draining). > > Yes, I realise that, even if I wasn't very clear about how I wrote > it. ;) > > Correct me if I'm wrong: AFAIA, dispatch ordering (i.e. the "IO > barrier") is still enforced by the scheduler via REQ_FUA|REQ_FLUSH > -> ELEVATOR_INSERT_FLUSH -> REQ_SOFTBARRIER and subsequent IO > scheduler calls to elv_dispatch_sort() that don't pass > REQ_SOFTBARRIER in the queue. This part is right. > IOWs, if we queue a bunch of REQ_WRITE IOs followed by a > REQ_WRITE|REQ_FLUSH IO, all of the prior REQ_WRITE IOs will be > dispatched before the REQ_WRITE|REQ_FLUSH IO and hence be captured > by the cache flush. But this part is not. It is up to the I/O scheduler to decide when to dispatch requests. It can hold on to them for a variety of reasons. Flush requests, however, do not go through the I/O scheduler. At the very moment that the flush request is inserted, it goes directly to the dispatch queue (assuming no other flush is in progress). The prior requests may still be waiting in the I/O scheduler's internal lists. So, any newly dispatched I/Os will certainly not get past the REQ_FLUSH. However, the REQ_FLUSH is very likely to jump ahead of prior I/Os in the queue. > Hence once the filesystem has waited on the REQ_WRITE|REQ_FLUSH IO > to complete, we know that all the earlier REQ_WRITE IOs are on > stable storage, too. Hence there's no need for the elevator to drain > the queue to guarantee completion ordering - the dispatch ordering > and flush/fua write semantics guarantee that when the flush/fua > completes, all the IOs dispatch prior to that flush/fua write are > also on stable storage... Des xfs rely on this model for correctness? If so, I'd say we've got a problem. Cheers, Jeff _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id E365A7FD7 for ; Wed, 4 Nov 2015 12:35:07 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay2.corp.sgi.com (Postfix) with ESMTP id D1C3F30404E for ; Wed, 4 Nov 2015 10:35:07 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by cuda.sgi.com with ESMTP id dyPxwZo48v6dofSC (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Wed, 04 Nov 2015 10:35:04 -0800 (PST) From: Jeff Moyer Subject: Re: [RFC 00/11] DAX fsynx/msync support References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> <20151101232948.GF10656@dastard> <20151102201029.GI10656@dastard> Date: Wed, 04 Nov 2015 13:34:58 -0500 In-Reply-To: (Jeff Moyer's message of "Mon, 02 Nov 2015 16:02:48 -0500") Message-ID: MIME-Version: 1.0 List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dave Chinner Cc: axboe@kernel.dk, Theodore Ts'o , Andrew Morton , linux-nvdimm@ml01.01.org, x86@kernel.org, linux-kernel@vger.kernel.org, xfs@oss.sgi.com, "J. Bruce Fields" , linux-mm@kvack.org, Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel@vger.kernel.org, Jeff Layton , linux-ext4@vger.kernel.org, Thomas Gleixner , Jan Kara , Matthew Wilcox Jeff Moyer writes: >> Hence once the filesystem has waited on the REQ_WRITE|REQ_FLUSH IO >> to complete, we know that all the earlier REQ_WRITE IOs are on >> stable storage, too. Hence there's no need for the elevator to drain >> the queue to guarantee completion ordering - the dispatch ordering >> and flush/fua write semantics guarantee that when the flush/fua >> completes, all the IOs dispatch prior to that flush/fua write are >> also on stable storage... > > Des xfs rely on this model for correctness? If so, I'd say we've got a > problem. Dave? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id 47D477FB5 for ; Thu, 5 Nov 2015 02:33:20 -0600 (CST) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay1.corp.sgi.com (Postfix) with ESMTP id 176878F8049 for ; Thu, 5 Nov 2015 00:33:16 -0800 (PST) Received: from ipmail06.adl6.internode.on.net (ipmail06.adl6.internode.on.net [150.101.137.145]) by cuda.sgi.com with ESMTP id XHzaOWeJ3Q6QTXxq for ; Thu, 05 Nov 2015 00:33:13 -0800 (PST) Date: Thu, 5 Nov 2015 19:33:09 +1100 From: Dave Chinner Subject: Re: [RFC 00/11] DAX fsynx/msync support Message-ID: <20151105083309.GJ19199@dastard> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> <20151101232948.GF10656@dastard> <20151102201029.GI10656@dastard> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Jeff Moyer Cc: linux-nvdimm@ml01.01.org, "J. Bruce Fields" , linux-mm@kvack.org, Andreas Dilger , "H. Peter Anvin" , Jeff Layton , Dan Williams , x86@kernel.org, Ingo Molnar , Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, xfs@oss.sgi.com, Alexander Viro , Thomas Gleixner , axboe@kernel.dk, Theodore Ts'o , linux-kernel@vger.kernel.org, Jan Kara , linux-fsdevel@vger.kernel.org, Andrew Morton , Matthew Wilcox [ sorry for slow response, been without an internet connection for ~36 hours ] On Mon, Nov 02, 2015 at 04:02:48PM -0500, Jeff Moyer wrote: > Dave Chinner writes: > > > On Mon, Nov 02, 2015 at 09:22:15AM -0500, Jeff Moyer wrote: > >> Dave Chinner writes: > >> > >> > Further, REQ_FLUSH/REQ_FUA are more than just "put the data on stable > >> > storage" commands. They are also IO barriers that affect scheduling > >> > of IOs in progress and in the request queues. A REQ_FLUSH/REQ_FUA > >> > IO cannot be dispatched before all prior IO has been dispatched and > >> > drained from the request queue, and IO submitted after a queued > >> > REQ_FLUSH/REQ_FUA cannot be scheduled ahead of the queued > >> > REQ_FLUSH/REQ_FUA operation. > >> > > >> > IOWs, REQ_FUA/REQ_FLUSH not only guarantee data is on stable > >> > storage, they also guarantee the order of IO dispatch and > >> > completion when concurrent IO is in progress. > >> > >> This hasn't been the case for several years, now. It used to work that > >> way, and that was deemed a big performance problem. Since file systems > >> already issued and waited for all I/O before sending down a barrier, we > >> decided to get rid of the I/O ordering pieces of barriers (and stop > >> calling them barriers). > >> > >> See commit 28e7d184521 (block: drop barrier ordering by queue draining). > > > > Yes, I realise that, even if I wasn't very clear about how I wrote > > it. ;) > > > > Correct me if I'm wrong: AFAIA, dispatch ordering (i.e. the "IO > > barrier") is still enforced by the scheduler via REQ_FUA|REQ_FLUSH > > -> ELEVATOR_INSERT_FLUSH -> REQ_SOFTBARRIER and subsequent IO > > scheduler calls to elv_dispatch_sort() that don't pass > > REQ_SOFTBARRIER in the queue. > > This part is right. > > > IOWs, if we queue a bunch of REQ_WRITE IOs followed by a > > REQ_WRITE|REQ_FLUSH IO, all of the prior REQ_WRITE IOs will be > > dispatched before the REQ_WRITE|REQ_FLUSH IO and hence be captured > > by the cache flush. > > But this part is not. It is up to the I/O scheduler to decide when to > dispatch requests. It can hold on to them for a variety of reasons. > Flush requests, however, do not go through the I/O scheduler. At the That's pure REQ_FLUSH bios, right? Aren't data IOs with REQ_FLUSH|REQ_FUA sorted like any other IO? > very moment that the flush request is inserted, it goes directly to the > dispatch queue (assuming no other flush is in progress). The prior > requests may still be waiting in the I/O scheduler's internal lists. > > So, any newly dispatched I/Os will certainly not get past the REQ_FLUSH. > However, the REQ_FLUSH is very likely to jump ahead of prior I/Os in the > queue. Uh, ok, that's different, and most definitely not the "IO barrier" I was under the impression REQ_FLUSH|REQ_FUA gave us. > > Hence once the filesystem has waited on the REQ_WRITE|REQ_FLUSH IO > > to complete, we know that all the earlier REQ_WRITE IOs are on > > stable storage, too. Hence there's no need for the elevator to drain > > the queue to guarantee completion ordering - the dispatch ordering > > and flush/fua write semantics guarantee that when the flush/fua > > completes, all the IOs dispatch prior to that flush/fua write are > > also on stable storage... > > Des xfs rely on this model for correctness? If so, I'd say we've got a > problem No, it doesn't. The XFS integrity model doesn't trust the IO layers to tell the truth about IO ordering and completion or for it's developers to fully understand how IO layer ordering works. :P i.e. we wait for full completions of all dependent IO before issuing flushes or log writes that use REQ_FLUSH|REQ_FUA semantics to ensure the dependent IOs are fully caught by the cache flushes... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id 80B967F8D for ; Thu, 5 Nov 2015 13:49:30 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay3.corp.sgi.com (Postfix) with ESMTP id CD9B0AC001 for ; Thu, 5 Nov 2015 11:49:29 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by cuda.sgi.com with ESMTP id 3QyEVUC55SHGagLy (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Thu, 05 Nov 2015 11:49:28 -0800 (PST) From: Jeff Moyer Subject: Re: [RFC 00/11] DAX fsynx/msync support References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> <20151101232948.GF10656@dastard> <20151102201029.GI10656@dastard> <20151105083309.GJ19199@dastard> Date: Thu, 05 Nov 2015 14:49:21 -0500 In-Reply-To: <20151105083309.GJ19199@dastard> (Dave Chinner's message of "Thu, 5 Nov 2015 19:33:09 +1100") Message-ID: MIME-Version: 1.0 List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dave Chinner Cc: linux-nvdimm@ml01.01.org, "J. Bruce Fields" , linux-mm@kvack.org, Andreas Dilger , "H. Peter Anvin" , Jeff Layton , Dan Williams , x86@kernel.org, Ingo Molnar , Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, xfs@oss.sgi.com, Alexander Viro , Thomas Gleixner , axboe@kernel.dk, Theodore Ts'o , linux-kernel@vger.kernel.org, Jan Kara , linux-fsdevel@vger.kernel.org, Andrew Morton , Matthew Wilcox Dave Chinner writes: >> But this part is not. It is up to the I/O scheduler to decide when to >> dispatch requests. It can hold on to them for a variety of reasons. >> Flush requests, however, do not go through the I/O scheduler. At the > > That's pure REQ_FLUSH bios, right? Aren't data IOs with > REQ_FLUSH|REQ_FUA sorted like any other IO? No, they also go through the flush machinery, and so short-circuit the I/O scheduler. >> Des xfs rely on this model for correctness? If so, I'd say we've got a >> problem > > No, it doesn't. The XFS integrity model doesn't trust the IO layers > to tell the truth about IO ordering and completion or for it's > developers to fully understand how IO layer ordering works. :P > > i.e. we wait for full completions of all dependent IO before issuing > flushes or log writes that use REQ_FLUSH|REQ_FUA semantics to ensure > the dependent IOs are fully caught by the cache flushes... OK, phew! ;-) -Jeff _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id E92E17F69 for ; Thu, 5 Nov 2015 14:54:26 -0600 (CST) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay3.corp.sgi.com (Postfix) with ESMTP id 87CBFAC001 for ; Thu, 5 Nov 2015 12:54:23 -0800 (PST) Received: from mail-io0-f171.google.com (mail-io0-f171.google.com [209.85.223.171]) by cuda.sgi.com with ESMTP id 1XbSe6GxNEZRJLNk (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO) for ; Thu, 05 Nov 2015 12:54:21 -0800 (PST) Received: by iodd200 with SMTP id d200so103536260iod.0 for ; Thu, 05 Nov 2015 12:54:21 -0800 (PST) Subject: Re: [RFC 00/11] DAX fsynx/msync support References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> <20151101232948.GF10656@dastard> <20151102201029.GI10656@dastard> <20151105083309.GJ19199@dastard> From: Jens Axboe Message-ID: <563BC1FB.60004@kernel.dk> Date: Thu, 5 Nov 2015 13:54:19 -0700 MIME-Version: 1.0 In-Reply-To: <20151105083309.GJ19199@dastard> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dave Chinner , Jeff Moyer Cc: Jeff Layton , x86@kernel.org, Theodore Ts'o , Andrew Morton , Thomas Gleixner , linux-nvdimm@ml01.01.org, Jan Kara , linux-kernel@vger.kernel.org, xfs@oss.sgi.com, "J. Bruce Fields" , linux-mm@kvack.org, Ingo Molnar , Andreas Dilger , Alexander Viro , "H. Peter Anvin" , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Dan Williams , Matthew Wilcox On 11/05/2015 01:33 AM, Dave Chinner wrote: >> Des xfs rely on this model for correctness? If so, I'd say we've got a >> problem > > No, it doesn't. The XFS integrity model doesn't trust the IO layers > to tell the truth about IO ordering and completion or for it's > developers to fully understand how IO layer ordering works. :P That's good, because the storage developers simplified the model so that fs developers would be able to get and use it. > i.e. we wait for full completions of all dependent IO before issuing > flushes or log writes that use REQ_FLUSH|REQ_FUA semantics to ensure > the dependent IOs are fully caught by the cache flushes... ... which is what you are supposed to do, that's how it works. -- Jens Axboe _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f49.google.com (mail-pa0-f49.google.com [209.85.220.49]) by kanga.kvack.org (Postfix) with ESMTP id 77D046B0038 for ; Mon, 2 Nov 2015 16:02:53 -0500 (EST) Received: by pacfv9 with SMTP id fv9so165756307pac.3 for ; Mon, 02 Nov 2015 13:02:53 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id zo6si37618971pbc.29.2015.11.02.13.02.52 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 02 Nov 2015 13:02:52 -0800 (PST) From: Jeff Moyer Subject: Re: [RFC 00/11] DAX fsynx/msync support References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> <20151101232948.GF10656@dastard> <20151102201029.GI10656@dastard> Date: Mon, 02 Nov 2015 16:02:48 -0500 In-Reply-To: <20151102201029.GI10656@dastard> (Dave Chinner's message of "Tue, 3 Nov 2015 07:10:29 +1100") Message-ID: MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: Dave Chinner Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dan Williams , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox , axboe@kernel.dk Dave Chinner writes: > On Mon, Nov 02, 2015 at 09:22:15AM -0500, Jeff Moyer wrote: >> Dave Chinner writes: >> >> > Further, REQ_FLUSH/REQ_FUA are more than just "put the data on stable >> > storage" commands. They are also IO barriers that affect scheduling >> > of IOs in progress and in the request queues. A REQ_FLUSH/REQ_FUA >> > IO cannot be dispatched before all prior IO has been dispatched and >> > drained from the request queue, and IO submitted after a queued >> > REQ_FLUSH/REQ_FUA cannot be scheduled ahead of the queued >> > REQ_FLUSH/REQ_FUA operation. >> > >> > IOWs, REQ_FUA/REQ_FLUSH not only guarantee data is on stable >> > storage, they also guarantee the order of IO dispatch and >> > completion when concurrent IO is in progress. >> >> This hasn't been the case for several years, now. It used to work that >> way, and that was deemed a big performance problem. Since file systems >> already issued and waited for all I/O before sending down a barrier, we >> decided to get rid of the I/O ordering pieces of barriers (and stop >> calling them barriers). >> >> See commit 28e7d184521 (block: drop barrier ordering by queue draining). > > Yes, I realise that, even if I wasn't very clear about how I wrote > it. ;) > > Correct me if I'm wrong: AFAIA, dispatch ordering (i.e. the "IO > barrier") is still enforced by the scheduler via REQ_FUA|REQ_FLUSH > -> ELEVATOR_INSERT_FLUSH -> REQ_SOFTBARRIER and subsequent IO > scheduler calls to elv_dispatch_sort() that don't pass > REQ_SOFTBARRIER in the queue. This part is right. > IOWs, if we queue a bunch of REQ_WRITE IOs followed by a > REQ_WRITE|REQ_FLUSH IO, all of the prior REQ_WRITE IOs will be > dispatched before the REQ_WRITE|REQ_FLUSH IO and hence be captured > by the cache flush. But this part is not. It is up to the I/O scheduler to decide when to dispatch requests. It can hold on to them for a variety of reasons. Flush requests, however, do not go through the I/O scheduler. At the very moment that the flush request is inserted, it goes directly to the dispatch queue (assuming no other flush is in progress). The prior requests may still be waiting in the I/O scheduler's internal lists. So, any newly dispatched I/Os will certainly not get past the REQ_FLUSH. However, the REQ_FLUSH is very likely to jump ahead of prior I/Os in the queue. > Hence once the filesystem has waited on the REQ_WRITE|REQ_FLUSH IO > to complete, we know that all the earlier REQ_WRITE IOs are on > stable storage, too. Hence there's no need for the elevator to drain > the queue to guarantee completion ordering - the dispatch ordering > and flush/fua write semantics guarantee that when the flush/fua > completes, all the IOs dispatch prior to that flush/fua write are > also on stable storage... Des xfs rely on this model for correctness? If so, I'd say we've got a problem. Cheers, Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756897AbbJ2UMb (ORCPT ); Thu, 29 Oct 2015 16:12:31 -0400 Received: from mga01.intel.com ([192.55.52.88]:30021 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750753AbbJ2UM2 (ORCPT ); Thu, 29 Oct 2015 16:12:28 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,215,1444719600"; d="scan'208";a="838730552" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Subject: [RFC 00/11] DAX fsynx/msync support Date: Thu, 29 Oct 2015 14:12:04 -0600 Message-Id: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch series adds support for fsync/msync to DAX. Patches 1 through 8 add various utilities that the DAX code will eventually need, and the DAX code itself is added by patch 9. Patches 10 and 11 are filesystem changes that are needed after the DAX code is added, but these patches may change slightly as the filesystem fault handling for DAX is being modified ([1] and [2]). I've marked this series as RFC because I'm still testing, but I wanted to get this out there so people would see the direction I was going and hopefully comment on any big red flags sooner rather than later. I realize that we are getting pretty dang close to the v4.4 merge window, but I think that if we can get this reviewed and working it's a much better solution than the "big hammer" approach that blindly flushes entire PMEM namespaces [3]. [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2 [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html Ross Zwisler (11): pmem: add wb_cache_pmem() to the PMEM API mm: add pmd_mkclean() pmem: enable REQ_FLUSH handling dax: support dirty DAX entries in radix tree mm: add follow_pte_pmd() mm: add pgoff_mkclean() mm: add find_get_entries_tag() fs: add get_block() to struct inode_operations dax: add support for fsync/sync xfs, ext2: call dax_pfn_mkwrite() on write fault ext4: add ext4_dax_pfn_mkwrite() arch/x86/include/asm/pgtable.h | 5 ++ arch/x86/include/asm/pmem.h | 11 +-- drivers/nvdimm/pmem.c | 3 +- fs/dax.c | 161 +++++++++++++++++++++++++++++++++++++++-- fs/ext2/file.c | 5 +- fs/ext4/file.c | 23 +++++- fs/inode.c | 1 + fs/xfs/xfs_file.c | 9 ++- fs/xfs/xfs_iops.c | 1 + include/linux/dax.h | 6 ++ include/linux/fs.h | 5 +- include/linux/mm.h | 2 + include/linux/pagemap.h | 3 + include/linux/pmem.h | 22 +++++- include/linux/radix-tree.h | 3 + include/linux/rmap.h | 5 ++ mm/filemap.c | 73 ++++++++++++++++++- mm/huge_memory.c | 14 ++-- mm/memory.c | 41 +++++++++-- mm/page-writeback.c | 9 +++ mm/rmap.c | 53 ++++++++++++++ mm/truncate.c | 5 +- 22 files changed, 418 insertions(+), 42 deletions(-) -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757462AbbJ2UMd (ORCPT ); Thu, 29 Oct 2015 16:12:33 -0400 Received: from mga01.intel.com ([192.55.52.88]:30021 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750967AbbJ2UM3 (ORCPT ); Thu, 29 Oct 2015 16:12:29 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,215,1444719600"; d="scan'208";a="838730559" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Subject: [RFC 01/11] pmem: add wb_cache_pmem() to the PMEM API Date: Thu, 29 Oct 2015 14:12:05 -0600 Message-Id: <1446149535-16200-2-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The function __arch_wb_cache_pmem() was already an internal implementation detail of the x86 PMEM API, but this functionality needs to be exported as part of the general PMEM API to handle the fsync/msync case for DAX mmaps. Signed-off-by: Ross Zwisler --- arch/x86/include/asm/pmem.h | 11 ++++++----- include/linux/pmem.h | 22 +++++++++++++++++++++- 2 files changed, 27 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h index d8ce3ec..6c7ade0 100644 --- a/arch/x86/include/asm/pmem.h +++ b/arch/x86/include/asm/pmem.h @@ -67,18 +67,19 @@ static inline void arch_wmb_pmem(void) } /** - * __arch_wb_cache_pmem - write back a cache range with CLWB + * arch_wb_cache_pmem - write back a cache range with CLWB * @vaddr: virtual start address * @size: number of bytes to write back * * Write back a cache range using the CLWB (cache line write back) * instruction. This function requires explicit ordering with an - * arch_wmb_pmem() call. This API is internal to the x86 PMEM implementation. + * arch_wmb_pmem() call. */ -static inline void __arch_wb_cache_pmem(void *vaddr, size_t size) +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size) { u16 x86_clflush_size = boot_cpu_data.x86_clflush_size; unsigned long clflush_mask = x86_clflush_size - 1; + void *vaddr = (void __force *)addr; void *vend = vaddr + size; void *p; @@ -115,7 +116,7 @@ static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes, len = copy_from_iter_nocache(vaddr, bytes, i); if (__iter_needs_pmem_wb(i)) - __arch_wb_cache_pmem(vaddr, bytes); + arch_wb_cache_pmem(addr, bytes); return len; } @@ -138,7 +139,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) else memset(vaddr, 0, size); - __arch_wb_cache_pmem(vaddr, size); + arch_wb_cache_pmem(addr, size); } static inline bool __arch_has_wmb_pmem(void) diff --git a/include/linux/pmem.h b/include/linux/pmem.h index 85f810b3..2cd5003 100644 --- a/include/linux/pmem.h +++ b/include/linux/pmem.h @@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size) { BUG(); } + +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size) +{ + BUG(); +} #endif /* * Architectures that define ARCH_HAS_PMEM_API must provide * implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(), - * arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem(). + * arch_copy_from_iter_pmem(), arch_clear_pmem(), arch_wb_cache_pmem() + * and arch_has_wmb_pmem(). */ static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size) { @@ -202,4 +208,18 @@ static inline void clear_pmem(void __pmem *addr, size_t size) else default_clear_pmem(addr, size); } + +/** + * wb_cache_pmem - write back processor cache for PMEM memory range + * @addr: virtual start address + * @size: number of bytes to write back + * + * Write back the processor cache range starting at 'addr' for 'size' bytes. + * This function requires explicit ordering with a wmb_pmem() call. + */ +static inline void wb_cache_pmem(void __pmem *addr, size_t size) +{ + if (arch_has_pmem_api()) + arch_wb_cache_pmem(addr, size); +} #endif /* __PMEM_H__ */ -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757817AbbJ2UVS (ORCPT ); Thu, 29 Oct 2015 16:21:18 -0400 Received: from mga01.intel.com ([192.55.52.88]:30021 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752703AbbJ2UMa (ORCPT ); Thu, 29 Oct 2015 16:12:30 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,215,1444719600"; d="scan'208";a="838730568" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Subject: [RFC 02/11] mm: add pmd_mkclean() Date: Thu, 29 Oct 2015 14:12:06 -0600 Message-Id: <1446149535-16200-3-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Currently PMD pages can be dirtied via pmd_mkdirty(), but cannot be cleaned. For DAX mmap dirty page tracking we need to be able to clean PMD pages when we flush them to media so that we get a new write fault the next time the are written to. Signed-off-by: Ross Zwisler --- arch/x86/include/asm/pgtable.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 867da5b..c548e4c 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -277,6 +277,11 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd) return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); } +static inline pmd_t pmd_mkclean(pmd_t pmd) +{ + return pmd_clear_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); +} + static inline pmd_t pmd_mkhuge(pmd_t pmd) { return pmd_set_flags(pmd, _PAGE_PSE); -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757532AbbJ2UMf (ORCPT ); Thu, 29 Oct 2015 16:12:35 -0400 Received: from mga01.intel.com ([192.55.52.88]:30021 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750753AbbJ2UMb (ORCPT ); Thu, 29 Oct 2015 16:12:31 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,215,1444719600"; d="scan'208";a="838730578" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Subject: [RFC 03/11] pmem: enable REQ_FLUSH handling Date: Thu, 29 Oct 2015 14:12:07 -0600 Message-Id: <1446149535-16200-4-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Currently the PMEM driver doesn't accept REQ_FLUSH bios. These are sent down via blkdev_issue_flush() in response to a fsync() or msync(). When we get an msync() or fsync() it is the responsibility of the DAX code to flush all dirty pages to media. The PMEM driver then just has issue a wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all the flushed data has been durably stored on the media. Signed-off-by: Ross Zwisler --- drivers/nvdimm/pmem.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 0ba6a97..e1e222e 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -80,7 +80,7 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio) if (do_acct) nd_iostat_end(bio, start); - if (bio_data_dir(bio)) + if (bio_data_dir(bio) || (bio->bi_rw & REQ_FLUSH)) wmb_pmem(); bio_endio(bio); @@ -189,6 +189,7 @@ static int pmem_attach_disk(struct device *dev, blk_queue_physical_block_size(pmem->pmem_queue, PAGE_SIZE); blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX); blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY); + blk_queue_flush(pmem->pmem_queue, REQ_FLUSH); queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem->pmem_queue); disk = alloc_disk(0); -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964777AbbJ2UTM (ORCPT ); Thu, 29 Oct 2015 16:19:12 -0400 Received: from mga01.intel.com ([192.55.52.88]:30021 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757056AbbJ2UMc (ORCPT ); Thu, 29 Oct 2015 16:12:32 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,215,1444719600"; d="scan'208";a="838730597" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Subject: [RFC 04/11] dax: support dirty DAX entries in radix tree Date: Thu, 29 Oct 2015 14:12:08 -0600 Message-Id: <1446149535-16200-5-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Add support for tracking dirty DAX entries in the struct address_space radix tree. This tree is already used for dirty page writeback, and it already supports the use of exceptional (non struct page*) entries. In order to properly track dirty DAX pages we will insert new exceptional entries into the radix tree that represent dirty DAX PTE or PMD pages. There are currently two types of exceptional entries (shmem and shadow) that can be placed into the radix tree, and this adds a third. There shouldn't be any collisions between these various exceptional entries because only one type of exceptional entry should be able to be found in a radix tree at a time depending on how it is being used. Signed-off-by: Ross Zwisler --- fs/inode.c | 1 + include/linux/dax.h | 5 +++++ include/linux/fs.h | 1 + include/linux/radix-tree.h | 3 +++ mm/filemap.c | 12 ++++++++---- mm/truncate.c | 5 +++-- 6 files changed, 21 insertions(+), 6 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 78a17b8..f7c87a6 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -496,6 +496,7 @@ void clear_inode(struct inode *inode) spin_lock_irq(&inode->i_data.tree_lock); BUG_ON(inode->i_data.nrpages); BUG_ON(inode->i_data.nrshadows); + BUG_ON(inode->i_data.nrdax); spin_unlock_irq(&inode->i_data.tree_lock); BUG_ON(!list_empty(&inode->i_data.private_list)); BUG_ON(!(inode->i_state & I_FREEING)); diff --git a/include/linux/dax.h b/include/linux/dax.h index b415e52..e9d57f68 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma) { return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); } + +static inline bool dax_mapping(struct address_space *mapping) +{ + return mapping->host && IS_DAX(mapping->host); +} #endif diff --git a/include/linux/fs.h b/include/linux/fs.h index 72d8a84..f791698 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -433,6 +433,7 @@ struct address_space { /* Protected by tree_lock together with the radix tree */ unsigned long nrpages; /* number of total pages */ unsigned long nrshadows; /* number of shadow entries */ + unsigned long nrdax; /* number of DAX entries */ pgoff_t writeback_index;/* writeback starts here */ const struct address_space_operations *a_ops; /* methods */ unsigned long flags; /* error bits/gfp mask */ diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index 33170db..fabec66 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -51,6 +51,9 @@ #define RADIX_TREE_EXCEPTIONAL_ENTRY 2 #define RADIX_TREE_EXCEPTIONAL_SHIFT 2 +#define RADIX_TREE_DAX_PTE ((void *)(0x10 | RADIX_TREE_EXCEPTIONAL_ENTRY)) +#define RADIX_TREE_DAX_PMD ((void *)(0x20 | RADIX_TREE_EXCEPTIONAL_ENTRY)) + static inline int radix_tree_is_indirect_ptr(void *ptr) { return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR); diff --git a/mm/filemap.c b/mm/filemap.c index 327910c..c3a9e4f 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -11,6 +11,7 @@ */ #include #include +#include #include #include #include @@ -440,7 +441,7 @@ int filemap_write_and_wait_range(struct address_space *mapping, { int err = 0; - if (mapping->nrpages) { + if (mapping->nrpages || mapping->nrdax) { err = __filemap_fdatawrite_range(mapping, lstart, lend, WB_SYNC_ALL); /* See comment of filemap_write_and_wait() */ @@ -538,6 +539,9 @@ static int page_cache_tree_insert(struct address_space *mapping, p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock); if (!radix_tree_exceptional_entry(p)) return -EEXIST; + + BUG_ON(dax_mapping(mapping)); + if (shadowp) *shadowp = p; mapping->nrshadows--; @@ -1201,9 +1205,9 @@ repeat: if (radix_tree_deref_retry(page)) goto restart; /* - * A shadow entry of a recently evicted page, - * or a swap entry from shmem/tmpfs. Return - * it without attempting to raise page count. + * A shadow entry of a recently evicted page, a swap + * entry from shmem/tmpfs or a DAX entry. Return it + * without attempting to raise page count. */ goto export; } diff --git a/mm/truncate.c b/mm/truncate.c index 76e35ad..cdf44a0 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -9,6 +9,7 @@ #include #include +#include #include #include #include @@ -29,8 +30,8 @@ static void clear_exceptional_entry(struct address_space *mapping, struct radix_tree_node *node; void **slot; - /* Handled by shmem itself */ - if (shmem_mapping(mapping)) + /* Handled by shmem or DAX directly */ + if (shmem_mapping(mapping) || dax_mapping(mapping)) return; spin_lock_irq(&mapping->tree_lock); -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757599AbbJ2UMh (ORCPT ); Thu, 29 Oct 2015 16:12:37 -0400 Received: from mga01.intel.com ([192.55.52.88]:25873 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757294AbbJ2UMd (ORCPT ); Thu, 29 Oct 2015 16:12:33 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,215,1444719600"; d="scan'208";a="838730621" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Subject: [RFC 05/11] mm: add follow_pte_pmd() Date: Thu, 29 Oct 2015 14:12:09 -0600 Message-Id: <1446149535-16200-6-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Similar to follow_pte(), follow_pte_pmd() allows either a PTE leaf or a huge page PMD leaf to be found and returned. Signed-off-by: Ross Zwisler --- include/linux/mm.h | 2 ++ mm/memory.c | 41 +++++++++++++++++++++++++++++++++-------- 2 files changed, 35 insertions(+), 8 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 80001de..393441c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1166,6 +1166,8 @@ int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma); void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows); +int follow_pte_pmd(struct mm_struct *mm, unsigned long address, + pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp); int follow_pfn(struct vm_area_struct *vma, unsigned long address, unsigned long *pfn); int follow_phys(struct vm_area_struct *vma, unsigned long address, diff --git a/mm/memory.c b/mm/memory.c index deb679c..c2b8c0a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3512,8 +3512,8 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) } #endif /* __PAGETABLE_PMD_FOLDED */ -static int __follow_pte(struct mm_struct *mm, unsigned long address, - pte_t **ptepp, spinlock_t **ptlp) +static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address, + pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp) { pgd_t *pgd; pud_t *pud; @@ -3529,12 +3529,23 @@ static int __follow_pte(struct mm_struct *mm, unsigned long address, goto out; pmd = pmd_offset(pud, address); - VM_BUG_ON(pmd_trans_huge(*pmd)); - if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) - goto out; - /* We cannot handle huge page PFN maps. Luckily they don't exist. */ - if (pmd_huge(*pmd)) + if (pmd_huge(*pmd)) { + if (!pmdpp) + goto out; + + *ptlp = pmd_lock(mm, pmd); + if (pmd_huge(*pmd)) { + /* Success, we found a large PTE */ + *pmdpp = pmd; + return 0; + } + /* Somebody removed the PMD entry, try it as a pte */ + spin_unlock(*ptlp); + } + + /* FIXME: pmd_bad() is sometimes set for DAX pmds? */ + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) goto out; ptep = pte_offset_map_lock(mm, pmd, address, ptlp); @@ -3557,9 +3568,23 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address, /* (void) is needed to make gcc happy */ (void) __cond_lock(*ptlp, - !(res = __follow_pte(mm, address, ptepp, ptlp))); + !(res = __follow_pte_pmd(mm, address, ptepp, NULL, + ptlp))); + return res; +} + +int follow_pte_pmd(struct mm_struct *mm, unsigned long address, + pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp) +{ + int res; + + /* (void) is needed to make gcc happy */ + (void) __cond_lock(*ptlp, + !(res = __follow_pte_pmd(mm, address, ptepp, pmdpp, + ptlp))); return res; } +EXPORT_SYMBOL(follow_pte_pmd); /** * follow_pfn - look up PFN at a user virtual address -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757846AbbJ2UPw (ORCPT ); Thu, 29 Oct 2015 16:15:52 -0400 Received: from mga01.intel.com ([192.55.52.88]:56848 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750967AbbJ2UMf (ORCPT ); Thu, 29 Oct 2015 16:12:35 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,215,1444719600"; d="scan'208";a="838730634" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Subject: [RFC 06/11] mm: add pgoff_mkclean() Date: Thu, 29 Oct 2015 14:12:10 -0600 Message-Id: <1446149535-16200-7-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Introduce pgoff_mkclean() which conceptually is similar to page_mkclean() except it works in the absence of struct page and it can also be used to clean PMDs. This is needed for DAX's dirty page handling. Signed-off-by: Ross Zwisler --- include/linux/rmap.h | 5 +++++ mm/rmap.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 58 insertions(+) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 29446ae..627875f9 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -223,6 +223,11 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *); int page_mkclean(struct page *); /* + * Cleans and write protects the PTEs of shared mappings. + */ +int pgoff_mkclean(pgoff_t, struct address_space *); + +/* * called in munlock()/munmap() path to check for other vmas holding * the page mlocked. */ diff --git a/mm/rmap.c b/mm/rmap.c index f5b5c1f..0ce16ab 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -586,6 +586,16 @@ vma_address(struct page *page, struct vm_area_struct *vma) return address; } +static inline unsigned long +pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma) +{ + unsigned long address; + + address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); + VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); + return address; +} + #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH static void percpu_flush_tlb_batch_pages(void *data) { @@ -1040,6 +1050,49 @@ int page_mkclean(struct page *page) } EXPORT_SYMBOL_GPL(page_mkclean); +int pgoff_mkclean(pgoff_t pgoff, struct address_space *mapping) +{ + struct vm_area_struct *vma; + int ret = 0; + + i_mmap_lock_read(mapping); + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { + struct mm_struct *mm = vma->vm_mm; + pmd_t pmd, *pmdp = NULL; + pte_t pte, *ptep = NULL; + unsigned long address; + spinlock_t *ptl; + + address = pgoff_address(pgoff, vma); + + ret = follow_pte_pmd(mm, address, &ptep, &pmdp, &ptl); + if (ret) + goto out; + + if (pmdp) { + flush_cache_page(vma, address, pmd_pfn(*pmdp)); + pmd = pmdp_huge_clear_flush(vma, address, pmdp); + pmd = pmd_wrprotect(pmd); + pmd = pmd_mkclean(pmd); + set_pmd_at(mm, address, pmdp, pmd); + spin_unlock(ptl); + } else { + BUG_ON(!ptep); + flush_cache_page(vma, address, pte_pfn(*ptep)); + pte = ptep_clear_flush(vma, address, ptep); + pte = pte_wrprotect(pte); + pte = pte_mkclean(pte); + set_pte_at(mm, address, ptep, pte); + pte_unmap_unlock(ptep, ptl); + } + } + + out: + i_mmap_unlock_read(mapping); + return ret; +} +EXPORT_SYMBOL_GPL(pgoff_mkclean); + /** * page_move_anon_rmap - move a page to our anon_vma * @page: the page to move to our anon_vma -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757790AbbJ2UPv (ORCPT ); Thu, 29 Oct 2015 16:15:51 -0400 Received: from mga01.intel.com ([192.55.52.88]:22552 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757531AbbJ2UMf (ORCPT ); Thu, 29 Oct 2015 16:12:35 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,215,1444719600"; d="scan'208";a="838730643" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Subject: [RFC 07/11] mm: add find_get_entries_tag() Date: Thu, 29 Oct 2015 14:12:11 -0600 Message-Id: <1446149535-16200-8-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Add find_get_entries_tag() to the family of functions that include find_get_entries(), find_get_pages() and find_get_pages_tag(). This is needed for DAX dirty page handling because we need a list of both page offsets and radix tree entries ('indices' and 'entries' in this function) that are marked with the PAGECACHE_TAG_TOWRITE tag. Signed-off-by: Ross Zwisler --- include/linux/pagemap.h | 3 +++ mm/filemap.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 64 insertions(+) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index a6c78e0..6fea3be 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -354,6 +354,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start, unsigned int nr_pages, struct page **pages); unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, int tag, unsigned int nr_pages, struct page **pages); +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start, + int tag, unsigned int nr_entries, + struct page **entries, pgoff_t *indices); struct page *grab_cache_page_write_begin(struct address_space *mapping, pgoff_t index, unsigned flags); diff --git a/mm/filemap.c b/mm/filemap.c index c3a9e4f..992cf84 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1454,6 +1454,67 @@ repeat: } EXPORT_SYMBOL(find_get_pages_tag); +/** + * find_get_entries_tag - find and return entries that match @tag + * @mapping: the address_space to search + * @start: the starting page cache index + * @tag: the tag index + * @nr_entries: the maximum number of entries + * @entries: where the resulting entries are placed + * @indices: the cache indices corresponding to the entries in @entries + * + * Like find_get_entries, except we only return entries which are tagged with + * @tag. + */ +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start, + int tag, unsigned int nr_entries, + struct page **entries, pgoff_t *indices) +{ + void **slot; + unsigned int ret = 0; + struct radix_tree_iter iter; + + if (!nr_entries) + return 0; + + rcu_read_lock(); +restart: + radix_tree_for_each_tagged(slot, &mapping->page_tree, + &iter, start, tag) { + struct page *page; +repeat: + page = radix_tree_deref_slot(slot); + if (unlikely(!page)) + continue; + if (radix_tree_exception(page)) { + if (radix_tree_deref_retry(page)) + goto restart; + /* + * A shadow entry of a recently evicted page, a swap + * entry from shmem/tmpfs or a DAX entry. Return it + * without attempting to raise page count. + */ + goto export; + } + if (!page_cache_get_speculative(page)) + goto repeat; + + /* Has the page moved? */ + if (unlikely(page != *slot)) { + page_cache_release(page); + goto repeat; + } +export: + indices[ret] = iter.index; + entries[ret] = page; + if (++ret == nr_entries) + break; + } + rcu_read_unlock(); + return ret; +} +EXPORT_SYMBOL(find_get_entries_tag); + /* * CD/DVDs are error prone. When a medium error occurs, the driver may fail * a _large_ part of the i/o request. Imagine the worst scenario: -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757828AbbJ2UOu (ORCPT ); Thu, 29 Oct 2015 16:14:50 -0400 Received: from mga01.intel.com ([192.55.52.88]:56848 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750753AbbJ2UMg (ORCPT ); Thu, 29 Oct 2015 16:12:36 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,215,1444719600"; d="scan'208";a="838730653" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Subject: [RFC 08/11] fs: add get_block() to struct inode_operations Date: Thu, 29 Oct 2015 14:12:12 -0600 Message-Id: <1446149535-16200-9-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org To be able to flush dirty pages to media as part of the fsync/msync path DAX needs to be able to map file offsets to kernel addresses via a combination of the filesystem's get_block() routine and bdev_direct_access(). This currently happens in the DAX fault handlers which receive a get_block() callback directly from the filesystem via a function parameter. For the fsync/msync path this doesn't work, though, because DAX is called not by the filesystem but by the writeback infrastructure which doesn't know about the filesystem specific get_block() routine. To handle this we make get_block() an entry in the struct inode_operations table so that we can access the correct get_block() routine in the context of the writeback infrastructure. Signed-off-by: Ross Zwisler --- fs/ext2/file.c | 1 + fs/ext4/file.c | 1 + fs/xfs/xfs_iops.c | 1 + include/linux/fs.h | 4 ++-- 4 files changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 11a42c5..fc1418c 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -202,4 +202,5 @@ const struct inode_operations ext2_file_inode_operations = { .get_acl = ext2_get_acl, .set_acl = ext2_set_acl, .fiemap = ext2_fiemap, + .get_block = ext2_get_block, }; diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 113837e..54d7729 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -720,5 +720,6 @@ const struct inode_operations ext4_file_inode_operations = { .get_acl = ext4_get_acl, .set_acl = ext4_set_acl, .fiemap = ext4_fiemap, + .get_block = ext4_get_block_dax, }; diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 8294132..c58c270 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -1112,6 +1112,7 @@ static const struct inode_operations xfs_inode_operations = { .listxattr = xfs_vn_listxattr, .fiemap = xfs_vn_fiemap, .update_time = xfs_vn_update_time, + .get_block = xfs_get_blocks_direct, }; static const struct inode_operations xfs_dir_inode_operations = { diff --git a/include/linux/fs.h b/include/linux/fs.h index f791698..1dca85b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1679,8 +1679,8 @@ struct inode_operations { umode_t create_mode, int *opened); int (*tmpfile) (struct inode *, struct dentry *, umode_t); int (*set_acl)(struct inode *, struct posix_acl *, int); - - /* WARNING: probably going away soon, do not use! */ + int (*get_block)(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create); } ____cacheline_aligned; ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector, -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757648AbbJ2UMm (ORCPT ); Thu, 29 Oct 2015 16:12:42 -0400 Received: from mga01.intel.com ([192.55.52.88]:32223 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757573AbbJ2UMh (ORCPT ); Thu, 29 Oct 2015 16:12:37 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,215,1444719600"; d="scan'208";a="838730663" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Subject: [RFC 09/11] dax: add support for fsync/sync Date: Thu, 29 Oct 2015 14:12:13 -0600 Message-Id: <1446149535-16200-10-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org To properly handle fsync/msync in an efficient way DAX needs to track dirty pages so it is able to flush them durably to media on demand. The tracking of dirty pages is done via the radix tree in struct address_space. This radix tree is already used by the page writeback infrastructure for tracking dirty pages associated with an open file, and it already has support for exceptional (non struct page*) entries. We build upon these features to add exceptional entries to the radix tree for DAX dirty PMD or PTE pages at fault time. When called as part of the msync/fsync flush path DAX queries the radix tree for dirty entries, flushing them and then marking the PTE or PMD page table entries as clean. The step of cleaning the PTE or PMD entries is necessary so that on subsequent writes to the same page we get a new write fault, allowing us to re-enter the DAX tag into the radix tree. Signed-off-by: Ross Zwisler --- fs/dax.c | 161 ++++++++++++++++++++++++++++++++++++++++++++++++++-- include/linux/dax.h | 1 + mm/huge_memory.c | 14 ++--- mm/page-writeback.c | 9 +++ 4 files changed, 172 insertions(+), 13 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 131fd35a..3b38aff 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -24,10 +24,13 @@ #include #include #include +#include #include +#include #include #include #include +#include /* * dax_clear_blocks() is called from within transaction context from XFS, @@ -287,6 +290,42 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh, return 0; } +static int dax_dirty_pgoff(struct vm_area_struct *vma, + struct address_space *mapping, unsigned long pgoff, + bool pmd_entry) +{ + struct radix_tree_root *page_tree = &mapping->page_tree; + void *tag; + int error = 0; + + __mark_inode_dirty(file_inode(vma->vm_file), I_DIRTY_PAGES); + + spin_lock_irq(&mapping->tree_lock); + + tag = radix_tree_lookup(page_tree, pgoff); + if (tag) { + if (pmd_entry && tag == RADIX_TREE_DAX_PTE) { + radix_tree_delete(&mapping->page_tree, pgoff); + mapping->nrdax--; + } else + goto out; + } + + if (pmd_entry) + error = radix_tree_insert(page_tree, pgoff, RADIX_TREE_DAX_PMD); + else + error = radix_tree_insert(page_tree, pgoff, RADIX_TREE_DAX_PTE); + + if (error) + goto out; + + mapping->nrdax++; + radix_tree_tag_set(page_tree, pgoff, PAGECACHE_TAG_DIRTY); + out: + spin_unlock_irq(&mapping->tree_lock); + return error; +} + static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, struct vm_area_struct *vma, struct vm_fault *vmf) { @@ -450,6 +489,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, delete_from_page_cache(page); unlock_page(page); page_cache_release(page); + page = NULL; } /* @@ -463,6 +503,13 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, * as for normal BH based IO completions. */ error = dax_insert_mapping(inode, &bh, vma, vmf); + if (error) + goto out; + + error = dax_dirty_pgoff(vma, inode->i_mapping, vmf->pgoff, false); + if (error) + goto out; + if (buffer_unwritten(&bh)) { if (complete_unwritten) complete_unwritten(&bh, !error); @@ -537,7 +584,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, pgoff_t size, pgoff; sector_t block, sector; unsigned long pfn; - int result = 0; + int error, result = 0; /* Fall back to PTEs if we're going to COW */ if (write && !(vma->vm_flags & VM_SHARED)) @@ -638,6 +685,10 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, } result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write); + + error = dax_dirty_pgoff(vma, inode->i_mapping, pgoff, true); + if (error) + goto fallback; } out: @@ -693,11 +744,11 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault); */ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) { - struct super_block *sb = file_inode(vma->vm_file)->i_sb; + struct file *file = vma->vm_file; + struct address_space *mapping = file->f_mapping; + struct inode *inode = mapping->host; - sb_start_pagefault(sb); - file_update_time(vma->vm_file); - sb_end_pagefault(sb); + dax_dirty_pgoff(vma, inode->i_mapping, vmf->pgoff, false); return VM_FAULT_NOPAGE; } EXPORT_SYMBOL_GPL(dax_pfn_mkwrite); @@ -772,3 +823,103 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) return dax_zero_page_range(inode, from, length, get_block); } EXPORT_SYMBOL_GPL(dax_truncate_page); + +static int dax_flush_one_mapping(struct address_space *mapping, + struct inode *inode, sector_t block, void *tag) +{ + get_block_t *get_block = inode->i_op->get_block; + struct buffer_head bh; + void __pmem *addr; + int ret; + + BUG_ON(tag != RADIX_TREE_DAX_PMD && tag != RADIX_TREE_DAX_PTE); + + memset(&bh, 0, sizeof(bh)); + + if (tag == RADIX_TREE_DAX_PMD) + bh.b_size = PMD_SIZE; + else + bh.b_size = PAGE_SIZE; + + ret = get_block(inode, block, &bh, false); + BUG_ON(!buffer_written(&bh)); + if (ret < 0) + return ret; + + ret = dax_get_addr(&bh, &addr, inode->i_blkbits); + if (ret < 0) + return ret; + + if (tag == RADIX_TREE_DAX_PMD) + WARN_ON(ret != PMD_SIZE); + else + WARN_ON(ret != PAGE_SIZE); + + wb_cache_pmem(addr, ret); + + spin_lock_irq(&mapping->tree_lock); + radix_tree_delete(&mapping->page_tree, block); + spin_unlock_irq(&mapping->tree_lock); + mapping->nrdax--; + + return pgoff_mkclean(block, mapping); +} + +/* + * flush the mapping to the persistent domain within the byte range of (start, + * end). This is required by data integrity operations to ensure file data is on + * persistent storage prior to completion of the operation. It also requires us + * to clean the mappings (i.e. write -> RO) so that we'll get a new fault when + * the file is written to again so wehave an indication that we need to flush + * the mapping if a data integrity operation takes place. + * + * We don't need commits to storage here - the filesystems will issue flushes + * appropriately at the conclusion of the data integrity operation via REQ_FUA + * writes or blkdev_issue_flush() commands. This requires the DAX block device + * to implement persistent storage domain fencing/commits on receiving a + * REQ_FLUSH or REQ_FUA request so that this works as expected by the higher + * layers. + */ +int dax_flush_mapping(struct address_space *mapping, loff_t start, loff_t end) +{ + struct inode *inode = mapping->host; + pgoff_t indices[PAGEVEC_SIZE]; + struct pagevec pvec; + int i, error; + + pgoff_t start_page = start >> PAGE_CACHE_SHIFT; + pgoff_t end_page = end >> PAGE_CACHE_SHIFT; + + + if (mapping->nrdax == 0) + return 0; + + if (!inode->i_op->get_block) { + WARN_ONCE(1, "Flushing DAX mapping without get_block()!"); + mapping->nrdax = 0; + return 0; + } + + BUG_ON(inode->i_blkbits != PAGE_SHIFT); + + tag_pages_for_writeback(mapping, start_page, end_page); + + pagevec_init(&pvec, 0); + while (1) { + pvec.nr = find_get_entries_tag(mapping, start_page, + PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE, + pvec.pages, indices); + + if (pvec.nr == 0) + break; + + for (i = 0; i < pvec.nr; i++) { + error = dax_flush_one_mapping(mapping, inode, + indices[i], pvec.pages[i]); + if (error) + return error; + } + } + + return 0; +} diff --git a/include/linux/dax.h b/include/linux/dax.h index e9d57f68..5eff476 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -41,4 +41,5 @@ static inline bool dax_mapping(struct address_space *mapping) { return mapping->host && IS_DAX(mapping->host); } +int dax_flush_mapping(struct address_space *mapping, loff_t start, loff_t end); #endif diff --git a/mm/huge_memory.c b/mm/huge_memory.c index bbac913..1b3df56 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -877,15 +877,13 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, spinlock_t *ptl; ptl = pmd_lock(mm, pmd); - if (pmd_none(*pmd)) { - entry = pmd_mkhuge(pfn_pmd(pfn, prot)); - if (write) { - entry = pmd_mkyoung(pmd_mkdirty(entry)); - entry = maybe_pmd_mkwrite(entry, vma); - } - set_pmd_at(mm, addr, pmd, entry); - update_mmu_cache_pmd(vma, addr, pmd); + entry = pmd_mkhuge(pfn_pmd(pfn, prot)); + if (write) { + entry = pmd_mkyoung(pmd_mkdirty(entry)); + entry = maybe_pmd_mkwrite(entry, vma); } + set_pmd_at(mm, addr, pmd, entry); + update_mmu_cache_pmd(vma, addr, pmd); spin_unlock(ptl); } diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2c90357..1801df8 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -37,6 +37,7 @@ #include #include #include +#include #include #include "internal.h" @@ -2338,6 +2339,14 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc) if (wbc->nr_to_write <= 0) return 0; + + if (wbc->sync_mode == WB_SYNC_ALL && dax_mapping(mapping)) { + ret = dax_flush_mapping(mapping, wbc->range_start, + wbc->range_end); + if (ret) + return ret; + } + if (mapping->a_ops->writepages) ret = mapping->a_ops->writepages(mapping, wbc); else -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757799AbbJ2UND (ORCPT ); Thu, 29 Oct 2015 16:13:03 -0400 Received: from mga01.intel.com ([192.55.52.88]:22675 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755607AbbJ2UMj (ORCPT ); Thu, 29 Oct 2015 16:12:39 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,215,1444719600"; d="scan'208";a="838730673" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Subject: [RFC 10/11] xfs, ext2: call dax_pfn_mkwrite() on write fault Date: Thu, 29 Oct 2015 14:12:14 -0600 Message-Id: <1446149535-16200-11-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Previously the functionality offered by dax_pfn_mkwrite() was open-coded in XFS and ext2. With the addition of DAX msync/fsync support we need to call dax_pfn_mkwrite() so that the newly writeable page can be added to the radix tree and marked as dirty. Signed-off-by: Ross Zwisler --- fs/ext2/file.c | 4 +++- fs/xfs/xfs_file.c | 9 +++++---- 2 files changed, 8 insertions(+), 5 deletions(-) diff --git a/fs/ext2/file.c b/fs/ext2/file.c index fc1418c..5741e29 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -102,8 +102,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma, { struct inode *inode = file_inode(vma->vm_file); struct ext2_inode_info *ei = EXT2_I(inode); - int ret = VM_FAULT_NOPAGE; loff_t size; + int ret; sb_start_pagefault(inode->i_sb); file_update_time(vma->vm_file); @@ -113,6 +113,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma, size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; if (vmf->pgoff >= size) ret = VM_FAULT_SIGBUS; + else + ret = dax_pfn_mkwrite(vma, vmf); up_read(&ei->dax_sem); sb_end_pagefault(inode->i_sb); diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index f429662..584e4eb 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1575,9 +1575,8 @@ xfs_filemap_pmd_fault( /* * pfn_mkwrite was originally inteneded to ensure we capture time stamp * updates on write faults. In reality, it's need to serialise against - * truncate similar to page_mkwrite. Hence we open-code dax_pfn_mkwrite() - * here and cycle the XFS_MMAPLOCK_SHARED to ensure we serialise the fault - * barrier in place. + * truncate similar to page_mkwrite. Hence we cycle the XFS_MMAPLOCK_SHARED + * to ensure we serialise the fault barrier in place. */ static int xfs_filemap_pfn_mkwrite( @@ -1587,7 +1586,7 @@ xfs_filemap_pfn_mkwrite( struct inode *inode = file_inode(vma->vm_file); struct xfs_inode *ip = XFS_I(inode); - int ret = VM_FAULT_NOPAGE; + int ret; loff_t size; trace_xfs_filemap_pfn_mkwrite(ip); @@ -1600,6 +1599,8 @@ xfs_filemap_pfn_mkwrite( size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; if (vmf->pgoff >= size) ret = VM_FAULT_SIGBUS; + else + ret = dax_pfn_mkwrite(vma, vmf); xfs_iunlock(ip, XFS_MMAPLOCK_SHARED); sb_end_pagefault(inode->i_sb); return ret; -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757739AbbJ2UNA (ORCPT ); Thu, 29 Oct 2015 16:13:00 -0400 Received: from mga01.intel.com ([192.55.52.88]:32223 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756914AbbJ2UMj (ORCPT ); Thu, 29 Oct 2015 16:12:39 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,215,1444719600"; d="scan'208";a="838730681" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Subject: [RFC 11/11] ext4: add ext4_dax_pfn_mkwrite() Date: Thu, 29 Oct 2015 14:12:15 -0600 Message-Id: <1446149535-16200-12-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Add ext4_dax_pfn_mkwrite() that properly protects against freeze, sets the file update time and calls into dax_pfn_mkwrite() to add the newly dirtied page to the radix tree which tracks dirty DAX pages. Signed-off-by: Ross Zwisler --- fs/ext4/file.c | 22 +++++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 54d7729..7f62cad 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -272,11 +272,31 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) ext4_end_io_unwritten); } +static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct inode *inode = file_inode(vma->vm_file); + loff_t size; + int ret; + + sb_start_pagefault(inode->i_sb); + file_update_time(vma->vm_file); + + /* check that the faulting page hasn't raced with truncate */ + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; + if (vmf->pgoff >= size) + ret = VM_FAULT_SIGBUS; + else + ret = dax_pfn_mkwrite(vma, vmf); + + sb_end_pagefault(inode->i_sb); + return ret; +} + static const struct vm_operations_struct ext4_dax_vm_ops = { .fault = ext4_dax_fault, .pmd_fault = ext4_dax_pmd_fault, .page_mkwrite = ext4_dax_mkwrite, - .pfn_mkwrite = dax_pfn_mkwrite, + .pfn_mkwrite = ext4_dax_pfn_mkwrite, }; #else #define ext4_dax_vm_ops ext4_file_vm_ops -- 2.1.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758251AbbJ2W42 (ORCPT ); Thu, 29 Oct 2015 18:56:28 -0400 Received: from mga03.intel.com ([134.134.136.65]:22487 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757671AbbJ2WuN (ORCPT ); Thu, 29 Oct 2015 18:50:13 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,216,1444719600"; d="scan'208";a="838826757" Date: Thu, 29 Oct 2015 16:49:53 -0600 From: Ross Zwisler To: Ross Zwisler , Dave Chinner Cc: linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Subject: Re: [RFC 00/11] DAX fsynx/msync support Message-ID: <20151029224953.GA17933@linux.intel.com> Mail-Followup-To: Ross Zwisler , Dave Chinner , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dan Williams , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 29, 2015 at 02:12:04PM -0600, Ross Zwisler wrote: > This patch series adds support for fsync/msync to DAX. > > Patches 1 through 8 add various utilities that the DAX code will eventually > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are > filesystem changes that are needed after the DAX code is added, but these > patches may change slightly as the filesystem fault handling for DAX is > being modified ([1] and [2]). > > I've marked this series as RFC because I'm still testing, but I wanted to > get this out there so people would see the direction I was going and > hopefully comment on any big red flags sooner rather than later. > > I realize that we are getting pretty dang close to the v4.4 merge window, > but I think that if we can get this reviewed and working it's a much better > solution than the "big hammer" approach that blindly flushes entire PMEM > namespaces [3]. > > [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html > [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2 > [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html Hmm...I think I may need to isolate the fsync/msync flushing against races with truncate since we are calling into the filesystem directly with get_block(). Dave (Chinner), does this sound right? Also, one thing I forgot to mention is that these patches are built upon the first version of Dave Chinner's XFS patches and my ext2 patches that deal with the truncate races with DAX. A snapshot of my development tree with these patches applied can be found here: https://github.com/01org/prd/tree/fsync_rfc From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965565AbbJ3Dzi (ORCPT ); Thu, 29 Oct 2015 23:55:38 -0400 Received: from ipmail05.adl6.internode.on.net ([150.101.137.143]:26156 "EHLO ipmail05.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753231AbbJ3Dzh (ORCPT ); Thu, 29 Oct 2015 23:55:37 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2AOBwAA6TJWXQLd03ZegztTb4Jdp1IBAQQGiy2LMBcMhXAEAgKBO00BAQEBAQEHRECENQEBAQMBAQIkExwjBQsIAxgJJQ8FFBEDBxoTG4gNBw7EWAEBAQEBBQIBIBmGF4VFgnGBVoR5BZZDhR2FHoJigimaFYR7KjQBhDSBSQEBAQ Date: Fri, 30 Oct 2015 14:55:33 +1100 From: Dave Chinner To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Subject: Re: [RFC 00/11] DAX fsynx/msync support Message-ID: <20151030035533.GU19199@dastard> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 29, 2015 at 02:12:04PM -0600, Ross Zwisler wrote: > This patch series adds support for fsync/msync to DAX. > > Patches 1 through 8 add various utilities that the DAX code will eventually > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are > filesystem changes that are needed after the DAX code is added, but these > patches may change slightly as the filesystem fault handling for DAX is > being modified ([1] and [2]). > > I've marked this series as RFC because I'm still testing, but I wanted to > get this out there so people would see the direction I was going and > hopefully comment on any big red flags sooner rather than later. > > I realize that we are getting pretty dang close to the v4.4 merge window, > but I think that if we can get this reviewed and working it's a much better > solution than the "big hammer" approach that blindly flushes entire PMEM > namespaces [3]. We need the "big hammer" regardless of fsync. If REQ_FLUSH and REQ_FUA don't do the right thing when it comes to ordering journal writes against other IO operations, then the filesystems are not crash safe. i.e. we need REQ_FLUSH/REQ_FUA to commit all outstanding changes back to stable storage, just like they do for existing storage.... > [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html > [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2 > [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html > > Ross Zwisler (11): > pmem: add wb_cache_pmem() to the PMEM API > mm: add pmd_mkclean() > pmem: enable REQ_FLUSH handling > dax: support dirty DAX entries in radix tree > mm: add follow_pte_pmd() > mm: add pgoff_mkclean() > mm: add find_get_entries_tag() > fs: add get_block() to struct inode_operations I don't think this is the right thing to do - it propagates the use of bufferheads as a mapping structure into places where we do not want bufferheads. We've recently added a similar block mapping interface to the export operations structure for PNFS and that uses a "struct iomap" which is far more suited to being an inode operation this. We have plans to move this to the inode operations for various reasons. e.g: multipage write, adding interfaces that support proper mapping of holes, etc: https://www.redhat.com/archives/cluster-devel/2014-October/msg00167.html So after many years of saying no to moving getblocks to the inode operations it seems like the wrong thing to do now considering I want to convert all the DAX code to use iomaps while only 2/3 filesystems are supported... > dax: add support for fsync/sync Why put the dax_flush_mapping() in do_writepages()? Why not call it directly from the filesystem ->fsync() implementations where a getblocks callback could also be provided? Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1760562AbbJ3SeL (ORCPT ); Fri, 30 Oct 2015 14:34:11 -0400 Received: from mail-wi0-f180.google.com ([209.85.212.180]:33640 "EHLO mail-wi0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752472AbbJ3SeI (ORCPT ); Fri, 30 Oct 2015 14:34:08 -0400 MIME-Version: 1.0 In-Reply-To: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> Date: Fri, 30 Oct 2015 11:34:07 -0700 Message-ID: Subject: Re: [RFC 00/11] DAX fsynx/msync support From: Dan Williams To: Ross Zwisler Cc: "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 29, 2015 at 1:12 PM, Ross Zwisler wrote: > This patch series adds support for fsync/msync to DAX. > > Patches 1 through 8 add various utilities that the DAX code will eventually > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are > filesystem changes that are needed after the DAX code is added, but these > patches may change slightly as the filesystem fault handling for DAX is > being modified ([1] and [2]). > > I've marked this series as RFC because I'm still testing, but I wanted to > get this out there so people would see the direction I was going and > hopefully comment on any big red flags sooner rather than later. > > I realize that we are getting pretty dang close to the v4.4 merge window, > but I think that if we can get this reviewed and working it's a much better > solution than the "big hammer" approach that blindly flushes entire PMEM > namespaces [3]. > > [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html > [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2 > [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html > > Ross Zwisler (11): > pmem: add wb_cache_pmem() to the PMEM API > mm: add pmd_mkclean() > pmem: enable REQ_FLUSH handling > dax: support dirty DAX entries in radix tree > mm: add follow_pte_pmd() > mm: add pgoff_mkclean() > mm: add find_get_entries_tag() > fs: add get_block() to struct inode_operations > dax: add support for fsync/sync > xfs, ext2: call dax_pfn_mkwrite() on write fault > ext4: add ext4_dax_pfn_mkwrite() This is great to have when the flush-the-world solution ends up killing performance. However, there are a couple mitigating options for workloads that dirty small amounts and flush often that we need to collect data on: 1/ Using cache management and pcommit from userspace to skip calls to msync / fsync. Although, this does not eliminate all calls to blkdev_issue_flush as the fs may invoke it for other reasons. I suspect turning on REQ_FUA support eliminates a number of those invocations, and pmem already satisfies REQ_FUA semantics by default. 2/ Turn off DAX and use the page cache. As Dave mentions [1] we should enable this control on a per-inode basis. I'm folding in this capability as a blkdev_ioctl for the next version of the raw block DAX support patch. It's entirely possible these mitigations won't eliminate the need for a mechanism like this, but I think we have a bit more work to do to find out how bad this is in practice as well as the crossover point where walking the radix becomes prohibitive. We also have the option of tracking open DAX extents in the driver. Even at coarse granularities I'd be surprised if we can't mitigate most of the overhead. [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-October/002598.html From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1760618AbbJ3Sjp (ORCPT ); Fri, 30 Oct 2015 14:39:45 -0400 Received: from mga09.intel.com ([134.134.136.24]:4836 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751509AbbJ3Sjl (ORCPT ); Fri, 30 Oct 2015 14:39:41 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,219,1444719600"; d="scan'208";a="591187898" Date: Fri, 30 Oct 2015 12:39:38 -0600 From: Ross Zwisler To: Dave Chinner Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Subject: Re: [RFC 00/11] DAX fsynx/msync support Message-ID: <20151030183938.GC24643@linux.intel.com> Mail-Followup-To: Ross Zwisler , Dave Chinner , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dan Williams , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151030035533.GU19199@dastard> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 30, 2015 at 02:55:33PM +1100, Dave Chinner wrote: > On Thu, Oct 29, 2015 at 02:12:04PM -0600, Ross Zwisler wrote: > > This patch series adds support for fsync/msync to DAX. > > > > Patches 1 through 8 add various utilities that the DAX code will eventually > > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are > > filesystem changes that are needed after the DAX code is added, but these > > patches may change slightly as the filesystem fault handling for DAX is > > being modified ([1] and [2]). > > > > I've marked this series as RFC because I'm still testing, but I wanted to > > get this out there so people would see the direction I was going and > > hopefully comment on any big red flags sooner rather than later. > > > > I realize that we are getting pretty dang close to the v4.4 merge window, > > but I think that if we can get this reviewed and working it's a much better > > solution than the "big hammer" approach that blindly flushes entire PMEM > > namespaces [3]. > > We need the "big hammer" regardless of fsync. If REQ_FLUSH and > REQ_FUA don't do the right thing when it comes to ordering journal > writes against other IO operations, then the filesystems are not > crash safe. i.e. we need REQ_FLUSH/REQ_FUA to commit all outstanding > changes back to stable storage, just like they do for existing > storage.... I think that what I've got here (when it's fully working) will protect all the cases that we need. AFAIK there are three ways that data can be written to a PMEM namespace: 1) Through the PMEM driver via either pmem_make_request(), pmem_rw_page() or pmem_rw_bytes(). All of these paths sync the newly written data durably to media before the I/O completes so they shouldn't have any reliance on REQ_FUA/REQ_FLUSH. 2) Through the DAX I/O path, dax_io(). As with PMEM we flush the newly written data durably to media before the I/O operation completes, so this path shouldn't have any reliance on REQ_FUA/REQ_FLUSH. 3) Through mmaps set up by DAX. This is the path we are trying to protect with the dirty page tracking and flushing in this patch set, and I think that this is the only path that has reliance on REQ_FLUSH. The goal of this set is to have the cache writeback all happen as part of the fsync/msync handling, and then have the REQ_FLUSH just provide the trailing wmb_pmem(). My guess is that XFS metadata writes happen via path 1), down through the PMEM driver. Am I missing anything, or should we be good to go? > > [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html > > [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2 > > [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html > > > > Ross Zwisler (11): > > pmem: add wb_cache_pmem() to the PMEM API > > mm: add pmd_mkclean() > > pmem: enable REQ_FLUSH handling > > dax: support dirty DAX entries in radix tree > > mm: add follow_pte_pmd() > > mm: add pgoff_mkclean() > > mm: add find_get_entries_tag() > > fs: add get_block() to struct inode_operations > > I don't think this is the right thing to do - it propagates the use > of bufferheads as a mapping structure into places where we do not > want bufferheads. We've recently added a similar block mapping > interface to the export operations structure for PNFS and that uses > a "struct iomap" which is far more suited to being an inode > operation this. > > We have plans to move this to the inode operations for various > reasons. e.g: multipage write, adding interfaces that support proper > mapping of holes, etc: > > https://www.redhat.com/archives/cluster-devel/2014-October/msg00167.html > > So after many years of saying no to moving getblocks to the inode > operations it seems like the wrong thing to do now considering I > want to convert all the DAX code to use iomaps while only 2/3 > filesystems are supported... Okay, I'll take a look at this interface. I also think that we may need to flow through the filesystem before going into the DAX code so that we can serialize our flushing with respect to extent manipulation, as we had to do with our DAX fault paths. > > dax: add support for fsync/sync > > Why put the dax_flush_mapping() in do_writepages()? Why not call it > directly from the filesystem ->fsync() implementations where a > getblocks callback could also be provided? Because that's where you put it in your example. :) https://lists.01.org/pipermail/linux-nvdimm/2015-March/000341.html Moving it into the filesystem where we know about get_block() is probably the right thing to do - I'll check it out. Thanks! From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030543AbbJ3TnG (ORCPT ); Fri, 30 Oct 2015 15:43:06 -0400 Received: from mga01.intel.com ([192.55.52.88]:33126 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030445AbbJ3TnD (ORCPT ); Fri, 30 Oct 2015 15:43:03 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,220,1444719600"; d="scan'208";a="675275057" Date: Fri, 30 Oct 2015 13:43:00 -0600 From: Ross Zwisler To: Dan Williams Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Subject: Re: [RFC 00/11] DAX fsynx/msync support Message-ID: <20151030194300.GA22670@linux.intel.com> Mail-Followup-To: Ross Zwisler , Dan Williams , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 30, 2015 at 11:34:07AM -0700, Dan Williams wrote: > On Thu, Oct 29, 2015 at 1:12 PM, Ross Zwisler > wrote: > > This patch series adds support for fsync/msync to DAX. > > > > Patches 1 through 8 add various utilities that the DAX code will eventually > > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are > > filesystem changes that are needed after the DAX code is added, but these > > patches may change slightly as the filesystem fault handling for DAX is > > being modified ([1] and [2]). > > > > I've marked this series as RFC because I'm still testing, but I wanted to > > get this out there so people would see the direction I was going and > > hopefully comment on any big red flags sooner rather than later. > > > > I realize that we are getting pretty dang close to the v4.4 merge window, > > but I think that if we can get this reviewed and working it's a much better > > solution than the "big hammer" approach that blindly flushes entire PMEM > > namespaces [3]. > > > > [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html > > [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2 > > [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html > > > > Ross Zwisler (11): > > pmem: add wb_cache_pmem() to the PMEM API > > mm: add pmd_mkclean() > > pmem: enable REQ_FLUSH handling > > dax: support dirty DAX entries in radix tree > > mm: add follow_pte_pmd() > > mm: add pgoff_mkclean() > > mm: add find_get_entries_tag() > > fs: add get_block() to struct inode_operations > > dax: add support for fsync/sync > > xfs, ext2: call dax_pfn_mkwrite() on write fault > > ext4: add ext4_dax_pfn_mkwrite() > > This is great to have when the flush-the-world solution ends up > killing performance. However, there are a couple mitigating options > for workloads that dirty small amounts and flush often that we need to > collect data on: > > 1/ Using cache management and pcommit from userspace to skip calls to > msync / fsync. Although, this does not eliminate all calls to > blkdev_issue_flush as the fs may invoke it for other reasons. I > suspect turning on REQ_FUA support eliminates a number of those > invocations, and pmem already satisfies REQ_FUA semantics by default. Sure, I'll turn on REQ_FUA in addition to REQ_FLUSH - I agree that PMEM already handles the requirements of REQ_FUA, but I didn't realize that it might reduce the number of REQ_FLUSH bios we receive. > 2/ Turn off DAX and use the page cache. As Dave mentions [1] we > should enable this control on a per-inode basis. I'm folding in this > capability as a blkdev_ioctl for the next version of the raw block DAX > support patch. Umm...I think you just said "the way to avoid this delay is to just not use DAX". :) I don't think this is where we want to go - we are trying to make DAX better, not abandon it. > It's entirely possible these mitigations won't eliminate the need for > a mechanism like this, but I think we have a bit more work to do to > find out how bad this is in practice as well as the crossover point > where walking the radix becomes prohibitive. I'm guessing a single run through xfstests will be enough to convince you that the "big hammer" approach is untenable. Tests that used to take a second now take several minutes, at least in my VM testing environment... And that's only using a tiny 4GiB namespace. Yes, we can distribute the cost over multiple CPUs, but that just distributes the problem and doesn't reduce the overall work that needs to be done. Ultimately I think that looping through multiple GiB or even TiB of cache lines and blindly writing them back individually on every REQ_FLUSH is going to be a deal breaker. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1760564AbbJ3Tvo (ORCPT ); Fri, 30 Oct 2015 15:51:44 -0400 Received: from mail-wi0-f177.google.com ([209.85.212.177]:33740 "EHLO mail-wi0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752504AbbJ3Tvl (ORCPT ); Fri, 30 Oct 2015 15:51:41 -0400 MIME-Version: 1.0 In-Reply-To: <20151030194300.GA22670@linux.intel.com> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030194300.GA22670@linux.intel.com> Date: Fri, 30 Oct 2015 12:51:40 -0700 Message-ID: Subject: Re: [RFC 00/11] DAX fsynx/msync support From: Dan Williams To: Ross Zwisler , Dan Williams , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 30, 2015 at 12:43 PM, Ross Zwisler wrote: > On Fri, Oct 30, 2015 at 11:34:07AM -0700, Dan Williams wrote: >> On Thu, Oct 29, 2015 at 1:12 PM, Ross Zwisler >> wrote: >> > This patch series adds support for fsync/msync to DAX. >> > >> > Patches 1 through 8 add various utilities that the DAX code will eventually >> > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are >> > filesystem changes that are needed after the DAX code is added, but these >> > patches may change slightly as the filesystem fault handling for DAX is >> > being modified ([1] and [2]). >> > >> > I've marked this series as RFC because I'm still testing, but I wanted to >> > get this out there so people would see the direction I was going and >> > hopefully comment on any big red flags sooner rather than later. >> > >> > I realize that we are getting pretty dang close to the v4.4 merge window, >> > but I think that if we can get this reviewed and working it's a much better >> > solution than the "big hammer" approach that blindly flushes entire PMEM >> > namespaces [3]. >> > >> > [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html >> > [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2 >> > [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html >> > >> > Ross Zwisler (11): >> > pmem: add wb_cache_pmem() to the PMEM API >> > mm: add pmd_mkclean() >> > pmem: enable REQ_FLUSH handling >> > dax: support dirty DAX entries in radix tree >> > mm: add follow_pte_pmd() >> > mm: add pgoff_mkclean() >> > mm: add find_get_entries_tag() >> > fs: add get_block() to struct inode_operations >> > dax: add support for fsync/sync >> > xfs, ext2: call dax_pfn_mkwrite() on write fault >> > ext4: add ext4_dax_pfn_mkwrite() >> >> This is great to have when the flush-the-world solution ends up >> killing performance. However, there are a couple mitigating options >> for workloads that dirty small amounts and flush often that we need to >> collect data on: >> >> 1/ Using cache management and pcommit from userspace to skip calls to >> msync / fsync. Although, this does not eliminate all calls to >> blkdev_issue_flush as the fs may invoke it for other reasons. I >> suspect turning on REQ_FUA support eliminates a number of those >> invocations, and pmem already satisfies REQ_FUA semantics by default. > > Sure, I'll turn on REQ_FUA in addition to REQ_FLUSH - I agree that PMEM > already handles the requirements of REQ_FUA, but I didn't realize that it > might reduce the number of REQ_FLUSH bios we receive. I'll let Dave chime in, but a lot of the flush requirements come from guaranteeing the state of the metadata, if metadata updates can be done with REQ_FUA then there is no subsequent need to flush. >> 2/ Turn off DAX and use the page cache. As Dave mentions [1] we >> should enable this control on a per-inode basis. I'm folding in this >> capability as a blkdev_ioctl for the next version of the raw block DAX >> support patch. > > Umm...I think you just said "the way to avoid this delay is to just not use > DAX". :) I don't think this is where we want to go - we are trying to make > DAX better, not abandon it. That's a bit of an exaggeration. Avoiding DAX where it is not necessary is not "abandoning DAX", it's using the right tool for the job. Page cache is fine for many cases. >> It's entirely possible these mitigations won't eliminate the need for >> a mechanism like this, but I think we have a bit more work to do to >> find out how bad this is in practice as well as the crossover point >> where walking the radix becomes prohibitive. > > I'm guessing a single run through xfstests will be enough to convince you that > the "big hammer" approach is untenable. Tests that used to take a second now > take several minutes, at least in my VM testing environment... And that's > only using a tiny 4GiB namespace. > > Yes, we can distribute the cost over multiple CPUs, but that just distributes > the problem and doesn't reduce the overall work that needs to be done. > Ultimately I think that looping through multiple GiB or even TiB of cache > lines and blindly writing them back individually on every REQ_FLUSH is going > to be a deal breaker. Right, part of the problem is that the driver doesn't know which blocks are actively DAX mapped. I think we can incrementally fix that without requiring DAX specific fsync/msync handling code for each fs that supports DAX. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753379AbbKAXaI (ORCPT ); Sun, 1 Nov 2015 18:30:08 -0500 Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:51446 "EHLO ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752220AbbKAXaF (ORCPT ); Sun, 1 Nov 2015 18:30:05 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2CCBgCTnzZW/8m/LHleDoMtgUKqMwEBAwEGiy6LL4YTBAICgSJNAQEBAQEBgQuENQEBAQMBOhwoCwgDGAklDwUlAyEBEhuIDQfBUgEBCAIBIBmGF4VFhEeDZYEUBZZDiA2FEIFhlm+DcmODRlIqNIQ1gUkBAQE Date: Mon, 2 Nov 2015 10:29:48 +1100 From: Dave Chinner To: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Subject: Re: [RFC 00/11] DAX fsynx/msync support Message-ID: <20151101232948.GF10656@dastard> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151030183938.GC24643@linux.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 30, 2015 at 12:39:38PM -0600, Ross Zwisler wrote: > On Fri, Oct 30, 2015 at 02:55:33PM +1100, Dave Chinner wrote: > > On Thu, Oct 29, 2015 at 02:12:04PM -0600, Ross Zwisler wrote: > > > This patch series adds support for fsync/msync to DAX. > > > > > > Patches 1 through 8 add various utilities that the DAX code will eventually > > > need, and the DAX code itself is added by patch 9. Patches 10 and 11 are > > > filesystem changes that are needed after the DAX code is added, but these > > > patches may change slightly as the filesystem fault handling for DAX is > > > being modified ([1] and [2]). > > > > > > I've marked this series as RFC because I'm still testing, but I wanted to > > > get this out there so people would see the direction I was going and > > > hopefully comment on any big red flags sooner rather than later. > > > > > > I realize that we are getting pretty dang close to the v4.4 merge window, > > > but I think that if we can get this reviewed and working it's a much better > > > solution than the "big hammer" approach that blindly flushes entire PMEM > > > namespaces [3]. > > > > We need the "big hammer" regardless of fsync. If REQ_FLUSH and > > REQ_FUA don't do the right thing when it comes to ordering journal > > writes against other IO operations, then the filesystems are not > > crash safe. i.e. we need REQ_FLUSH/REQ_FUA to commit all outstanding > > changes back to stable storage, just like they do for existing > > storage.... > > I think that what I've got here (when it's fully working) will protect all the > cases that we need. > > AFAIK there are three ways that data can be written to a PMEM namespace: > > 1) Through the PMEM driver via either pmem_make_request(), pmem_rw_page() or > pmem_rw_bytes(). All of these paths sync the newly written data durably to > media before the I/O completes so they shouldn't have any reliance on > REQ_FUA/REQ_FLUSH. I suspect that not all future pmem devices will use this driver/interface/semantics. Further, REQ_FLUSH/REQ_FUA are more than just "put the data on stable storage" commands. They are also IO barriers that affect scheduling of IOs in progress and in the request queues. A REQ_FLUSH/REQ_FUA IO cannot be dispatched before all prior IO has been dispatched and drained from the request queue, and IO submitted after a queued REQ_FLUSH/REQ_FUA cannot be scheduled ahead of the queued REQ_FLUSH/REQ_FUA operation. IOWs, REQ_FUA/REQ_FLUSH not only guarantee data is on stable storage, they also guarantee the order of IO dispatch and completion when concurrent IO is in progress. > 2) Through the DAX I/O path, dax_io(). As with PMEM we flush the newly > written data durably to media before the I/O operation completes, so this path > shouldn't have any reliance on REQ_FUA/REQ_FLUSH. That's fine, but that's not the problem we need solved ;) > 3) Through mmaps set up by DAX. This is the path we are trying to protect > with the dirty page tracking and flushing in this patch set, and I think that > this is the only path that has reliance on REQ_FLUSH. Quite possibly this is the case for the current intel pmem driver, but I don't look at the functionality from that perspective. Dirty page tracking is needed to enable "data writeback", whether it be CPU cachelines via pcommit() or dirty pages via submit_bio(). How the pages get dirty is irrelevant - the fact is they are dirty and we need to do /something/ to ensure they are correctly written back to the storage layer. REQ_FLUSH is needed to guarantee all data that has been written back to the storage layer is persistent in that layer. How a /driver/ manages that is up to the driver - the actual implementation is irrelevant to the higher layers. i.e. what we are concerned about at the filesystem level is that: a) "data writeback" is started correctly; b) the "data writeback" is completed; and c) volatile caches are completely flushed before we write the metadata changes that reference that data to the journal via FUA e.g. we could have pmem, but we are using buffered IO (i.e. non-DAX) and a hardware driver that doesn't flush CPU cachelines in the physical IO path. This requires that driver to flush CPU cachelines and place memory barriers in REQ_FLUSH operations, as well as after writing the data in REQ_FUA operations. Yes, this is different to the way the intel pmem drivers work (i.e. as noted in 1) above), but it is /not wrong/ as long as REQ_FLUSH/REQ_FUA also flush dirty cpu cachelines. IOWs, the high level code we write that implements fsync for DAX needs to be generic enough so that when something slightly different comes along we don't have to throw everything away and start again. I think your code will end up being generic enough to handle this, but let's make sure we don't implement something that can only work with pmem hardware/drivers that do all IO as fully synchronous to the stable domain... Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753334AbbKAXgv (ORCPT ); Sun, 1 Nov 2015 18:36:51 -0500 Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:20258 "EHLO ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752741AbbKAXgs (ORCPT ); Sun, 1 Nov 2015 18:36:48 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2DGBgCHoDZW/8m/LHlegzuBQqozAQEDAQaLLosvhhMCAgEBAoEiTQEBAQEBAYELhDUBAQEDATocIxAIAxgJJQ8FJQMhExuIDQfBSAEBCAIBIBmGF4Q/gQaJQAWWQ4gNhRCcQmOCER2Baio0AYV9AQEB Date: Mon, 2 Nov 2015 10:36:32 +1100 From: Dave Chinner To: Dan Williams Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Subject: Re: [RFC 00/11] DAX fsynx/msync support Message-ID: <20151101233632.GG10656@dastard> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030194300.GA22670@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 30, 2015 at 12:51:40PM -0700, Dan Williams wrote: > On Fri, Oct 30, 2015 at 12:43 PM, Ross Zwisler > wrote: > > On Fri, Oct 30, 2015 at 11:34:07AM -0700, Dan Williams wrote: > >> This is great to have when the flush-the-world solution ends up > >> killing performance. However, there are a couple mitigating options > >> for workloads that dirty small amounts and flush often that we need to > >> collect data on: > >> > >> 1/ Using cache management and pcommit from userspace to skip calls to > >> msync / fsync. Although, this does not eliminate all calls to > >> blkdev_issue_flush as the fs may invoke it for other reasons. I > >> suspect turning on REQ_FUA support eliminates a number of those > >> invocations, and pmem already satisfies REQ_FUA semantics by default. > > > > Sure, I'll turn on REQ_FUA in addition to REQ_FLUSH - I agree that PMEM > > already handles the requirements of REQ_FUA, but I didn't realize that it > > might reduce the number of REQ_FLUSH bios we receive. > > I'll let Dave chime in, but a lot of the flush requirements come from > guaranteeing the state of the metadata, if metadata updates can be > done with REQ_FUA then there is no subsequent need to flush. No need for cache flushes in this case, but we still need the IO scheduler to order such operations correctly. > >> 2/ Turn off DAX and use the page cache. As Dave mentions [1] we > >> should enable this control on a per-inode basis. I'm folding in this > >> capability as a blkdev_ioctl for the next version of the raw block DAX > >> support patch. > > > > Umm...I think you just said "the way to avoid this delay is to just not use > > DAX". :) I don't think this is where we want to go - we are trying to make > > DAX better, not abandon it. > > That's a bit of an exaggeration. Avoiding DAX where it is not > necessary is not "abandoning DAX", it's using the right tool for the > job. Page cache is fine for many cases. Think btrfs - any file that uses COW can't use DAX for write. Everything has to be buffered, unless the nodatacow flag is set, and then DAX can be used. Indeed, on ext4 if you are using file encryption you can't use DAX. IOWs, we already know that we have to support mixed DAX/non-DAX access within the same filesystem, so I'm with Dan here... Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932079AbbKBOWZ (ORCPT ); Mon, 2 Nov 2015 09:22:25 -0500 Received: from mx1.redhat.com ([209.132.183.28]:39691 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753803AbbKBOWW (ORCPT ); Mon, 2 Nov 2015 09:22:22 -0500 From: Jeff Moyer To: Dave Chinner Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Subject: Re: [RFC 00/11] DAX fsynx/msync support References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> <20151101232948.GF10656@dastard> X-PGP-KeyID: 1F78E1B4 X-PGP-CertKey: F6FE 280D 8293 F72C 65FD 5A58 1FF8 A7CA 1F78 E1B4 X-PCLoadLetter: What the f**k does that mean? Date: Mon, 02 Nov 2015 09:22:15 -0500 In-Reply-To: <20151101232948.GF10656@dastard> (Dave Chinner's message of "Mon, 2 Nov 2015 10:29:48 +1100") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Dave Chinner writes: > Further, REQ_FLUSH/REQ_FUA are more than just "put the data on stable > storage" commands. They are also IO barriers that affect scheduling > of IOs in progress and in the request queues. A REQ_FLUSH/REQ_FUA > IO cannot be dispatched before all prior IO has been dispatched and > drained from the request queue, and IO submitted after a queued > REQ_FLUSH/REQ_FUA cannot be scheduled ahead of the queued > REQ_FLUSH/REQ_FUA operation. > > IOWs, REQ_FUA/REQ_FLUSH not only guarantee data is on stable > storage, they also guarantee the order of IO dispatch and > completion when concurrent IO is in progress. This hasn't been the case for several years, now. It used to work that way, and that was deemed a big performance problem. Since file systems already issued and waited for all I/O before sending down a barrier, we decided to get rid of the I/O ordering pieces of barriers (and stop calling them barriers). See commit 28e7d184521 (block: drop barrier ordering by queue draining). Cheers, Jeff From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752699AbbKBUKt (ORCPT ); Mon, 2 Nov 2015 15:10:49 -0500 Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:28271 "EHLO ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752652AbbKBUKq (ORCPT ); Mon, 2 Nov 2015 15:10:46 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2CdBwBvwjdW/+rW03ZegzuBQqpFAQEBAQEBBosuiy+GEwICAQECgTtNAQEBAQEBgQuENgEBBCcTHCMQCAMOCgklDwUlAyETiC/BbgEBCAIBIBmGF4VFhEeEeQWWQ4gNhRCcQmOCER2Baio0hDWBSQEBAQ Date: Tue, 3 Nov 2015 07:10:29 +1100 From: Dave Chinner To: Jeff Moyer Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox Subject: Re: [RFC 00/11] DAX fsynx/msync support Message-ID: <20151102201029.GI10656@dastard> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> <20151101232948.GF10656@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 02, 2015 at 09:22:15AM -0500, Jeff Moyer wrote: > Dave Chinner writes: > > > Further, REQ_FLUSH/REQ_FUA are more than just "put the data on stable > > storage" commands. They are also IO barriers that affect scheduling > > of IOs in progress and in the request queues. A REQ_FLUSH/REQ_FUA > > IO cannot be dispatched before all prior IO has been dispatched and > > drained from the request queue, and IO submitted after a queued > > REQ_FLUSH/REQ_FUA cannot be scheduled ahead of the queued > > REQ_FLUSH/REQ_FUA operation. > > > > IOWs, REQ_FUA/REQ_FLUSH not only guarantee data is on stable > > storage, they also guarantee the order of IO dispatch and > > completion when concurrent IO is in progress. > > This hasn't been the case for several years, now. It used to work that > way, and that was deemed a big performance problem. Since file systems > already issued and waited for all I/O before sending down a barrier, we > decided to get rid of the I/O ordering pieces of barriers (and stop > calling them barriers). > > See commit 28e7d184521 (block: drop barrier ordering by queue draining). Yes, I realise that, even if I wasn't very clear about how I wrote it. ;) Correct me if I'm wrong: AFAIA, dispatch ordering (i.e. the "IO barrier") is still enforced by the scheduler via REQ_FUA|REQ_FLUSH -> ELEVATOR_INSERT_FLUSH -> REQ_SOFTBARRIER and subsequent IO scheduler calls to elv_dispatch_sort() that don't pass REQ_SOFTBARRIER in the queue. IOWs, if we queue a bunch of REQ_WRITE IOs followed by a REQ_WRITE|REQ_FLUSH IO, all of the prior REQ_WRITE IOs will be dispatched before the REQ_WRITE|REQ_FLUSH IO and hence be captured by the cache flush. Hence once the filesystem has waited on the REQ_WRITE|REQ_FLUSH IO to complete, we know that all the earlier REQ_WRITE IOs are on stable storage, too. Hence there's no need for the elevator to drain the queue to guarantee completion ordering - the dispatch ordering and flush/fua write semantics guarantee that when the flush/fua completes, all the IOs dispatch prior to that flush/fua write are also on stable storage... Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1032348AbbKDSf4 (ORCPT ); Wed, 4 Nov 2015 13:35:56 -0500 Received: from mx1.redhat.com ([209.132.183.28]:63095 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1031775AbbKDSfD (ORCPT ); Wed, 4 Nov 2015 13:35:03 -0500 From: Jeff Moyer To: Dave Chinner Cc: linux-nvdimm@ml01.01.org, "J. Bruce Fields" , linux-mm@kvack.org, Andreas Dilger , "H. Peter Anvin" , Jeff Layton , x86@kernel.org, Ingo Molnar , linux-ext4@vger.kernel.org, xfs@oss.sgi.com, Alexander Viro , Thomas Gleixner , axboe@kernel.dk, "Theodore Ts'o" , linux-kernel@vger.kernel.org, Jan Kara , linux-fsdevel@vger.kernel.org, Andrew Morton , Matthew Wilcox Subject: Re: [RFC 00/11] DAX fsynx/msync support References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> <20151101232948.GF10656@dastard> <20151102201029.GI10656@dastard> X-PGP-KeyID: 1F78E1B4 X-PGP-CertKey: F6FE 280D 8293 F72C 65FD 5A58 1FF8 A7CA 1F78 E1B4 X-PCLoadLetter: What the f**k does that mean? Date: Wed, 04 Nov 2015 13:34:58 -0500 In-Reply-To: (Jeff Moyer's message of "Mon, 02 Nov 2015 16:02:48 -0500") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Jeff Moyer writes: >> Hence once the filesystem has waited on the REQ_WRITE|REQ_FLUSH IO >> to complete, we know that all the earlier REQ_WRITE IOs are on >> stable storage, too. Hence there's no need for the elevator to drain >> the queue to guarantee completion ordering - the dispatch ordering >> and flush/fua write semantics guarantee that when the flush/fua >> completes, all the IOs dispatch prior to that flush/fua write are >> also on stable storage... > > Des xfs rely on this model for correctness? If so, I'd say we've got a > problem. Dave? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1032729AbbKEIdT (ORCPT ); Thu, 5 Nov 2015 03:33:19 -0500 Received: from ipmail06.adl6.internode.on.net ([150.101.137.145]:25491 "EHLO ipmail06.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030983AbbKEIdO (ORCPT ); Thu, 5 Nov 2015 03:33:14 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2AhCgDYEztWPHy2LHleKAECgxCBQoZcpBsBAQEBAQEGiy+JLYYMAgIBAQKBM00BAQEBAQEHAQEBAUE/hDUBAQEDAScTHCMFCwgDDgoJJQ8FJQMHGhOIJgfCAgEBCAIBIBmFdIVFhESEdQWWSIgNhQ6cSoJ0HYFqKjSDVoFJAQEB Date: Thu, 5 Nov 2015 19:33:09 +1100 From: Dave Chinner To: Jeff Moyer Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox , axboe@kernel.dk Subject: Re: [RFC 00/11] DAX fsynx/msync support Message-ID: <20151105083309.GJ19199@dastard> References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> <20151101232948.GF10656@dastard> <20151102201029.GI10656@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org [ sorry for slow response, been without an internet connection for ~36 hours ] On Mon, Nov 02, 2015 at 04:02:48PM -0500, Jeff Moyer wrote: > Dave Chinner writes: > > > On Mon, Nov 02, 2015 at 09:22:15AM -0500, Jeff Moyer wrote: > >> Dave Chinner writes: > >> > >> > Further, REQ_FLUSH/REQ_FUA are more than just "put the data on stable > >> > storage" commands. They are also IO barriers that affect scheduling > >> > of IOs in progress and in the request queues. A REQ_FLUSH/REQ_FUA > >> > IO cannot be dispatched before all prior IO has been dispatched and > >> > drained from the request queue, and IO submitted after a queued > >> > REQ_FLUSH/REQ_FUA cannot be scheduled ahead of the queued > >> > REQ_FLUSH/REQ_FUA operation. > >> > > >> > IOWs, REQ_FUA/REQ_FLUSH not only guarantee data is on stable > >> > storage, they also guarantee the order of IO dispatch and > >> > completion when concurrent IO is in progress. > >> > >> This hasn't been the case for several years, now. It used to work that > >> way, and that was deemed a big performance problem. Since file systems > >> already issued and waited for all I/O before sending down a barrier, we > >> decided to get rid of the I/O ordering pieces of barriers (and stop > >> calling them barriers). > >> > >> See commit 28e7d184521 (block: drop barrier ordering by queue draining). > > > > Yes, I realise that, even if I wasn't very clear about how I wrote > > it. ;) > > > > Correct me if I'm wrong: AFAIA, dispatch ordering (i.e. the "IO > > barrier") is still enforced by the scheduler via REQ_FUA|REQ_FLUSH > > -> ELEVATOR_INSERT_FLUSH -> REQ_SOFTBARRIER and subsequent IO > > scheduler calls to elv_dispatch_sort() that don't pass > > REQ_SOFTBARRIER in the queue. > > This part is right. > > > IOWs, if we queue a bunch of REQ_WRITE IOs followed by a > > REQ_WRITE|REQ_FLUSH IO, all of the prior REQ_WRITE IOs will be > > dispatched before the REQ_WRITE|REQ_FLUSH IO and hence be captured > > by the cache flush. > > But this part is not. It is up to the I/O scheduler to decide when to > dispatch requests. It can hold on to them for a variety of reasons. > Flush requests, however, do not go through the I/O scheduler. At the That's pure REQ_FLUSH bios, right? Aren't data IOs with REQ_FLUSH|REQ_FUA sorted like any other IO? > very moment that the flush request is inserted, it goes directly to the > dispatch queue (assuming no other flush is in progress). The prior > requests may still be waiting in the I/O scheduler's internal lists. > > So, any newly dispatched I/Os will certainly not get past the REQ_FLUSH. > However, the REQ_FLUSH is very likely to jump ahead of prior I/Os in the > queue. Uh, ok, that's different, and most definitely not the "IO barrier" I was under the impression REQ_FLUSH|REQ_FUA gave us. > > Hence once the filesystem has waited on the REQ_WRITE|REQ_FLUSH IO > > to complete, we know that all the earlier REQ_WRITE IOs are on > > stable storage, too. Hence there's no need for the elevator to drain > > the queue to guarantee completion ordering - the dispatch ordering > > and flush/fua write semantics guarantee that when the flush/fua > > completes, all the IOs dispatch prior to that flush/fua write are > > also on stable storage... > > Des xfs rely on this model for correctness? If so, I'd say we've got a > problem No, it doesn't. The XFS integrity model doesn't trust the IO layers to tell the truth about IO ordering and completion or for it's developers to fully understand how IO layer ordering works. :P i.e. we wait for full completions of all dependent IO before issuing flushes or log writes that use REQ_FLUSH|REQ_FUA semantics to ensure the dependent IOs are fully caught by the cache flushes... Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754828AbbKETta (ORCPT ); Thu, 5 Nov 2015 14:49:30 -0500 Received: from mx1.redhat.com ([209.132.183.28]:39221 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753732AbbKETt2 (ORCPT ); Thu, 5 Nov 2015 14:49:28 -0500 From: Jeff Moyer To: Dave Chinner Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox , axboe@kernel.dk Subject: Re: [RFC 00/11] DAX fsynx/msync support References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> <20151101232948.GF10656@dastard> <20151102201029.GI10656@dastard> <20151105083309.GJ19199@dastard> X-PGP-KeyID: 1F78E1B4 X-PGP-CertKey: F6FE 280D 8293 F72C 65FD 5A58 1FF8 A7CA 1F78 E1B4 X-PCLoadLetter: What the f**k does that mean? Date: Thu, 05 Nov 2015 14:49:21 -0500 In-Reply-To: <20151105083309.GJ19199@dastard> (Dave Chinner's message of "Thu, 5 Nov 2015 19:33:09 +1100") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Dave Chinner writes: >> But this part is not. It is up to the I/O scheduler to decide when to >> dispatch requests. It can hold on to them for a variety of reasons. >> Flush requests, however, do not go through the I/O scheduler. At the > > That's pure REQ_FLUSH bios, right? Aren't data IOs with > REQ_FLUSH|REQ_FUA sorted like any other IO? No, they also go through the flush machinery, and so short-circuit the I/O scheduler. >> Des xfs rely on this model for correctness? If so, I'd say we've got a >> problem > > No, it doesn't. The XFS integrity model doesn't trust the IO layers > to tell the truth about IO ordering and completion or for it's > developers to fully understand how IO layer ordering works. :P > > i.e. we wait for full completions of all dependent IO before issuing > flushes or log writes that use REQ_FLUSH|REQ_FUA semantics to ensure > the dependent IOs are fully caught by the cache flushes... OK, phew! ;-) -Jeff From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755783AbbKEUyY (ORCPT ); Thu, 5 Nov 2015 15:54:24 -0500 Received: from mail-io0-f180.google.com ([209.85.223.180]:35978 "EHLO mail-io0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752100AbbKEUyV (ORCPT ); Thu, 5 Nov 2015 15:54:21 -0500 Subject: Re: [RFC 00/11] DAX fsynx/msync support To: Dave Chinner , Jeff Moyer References: <1446149535-16200-1-git-send-email-ross.zwisler@linux.intel.com> <20151030035533.GU19199@dastard> <20151030183938.GC24643@linux.intel.com> <20151101232948.GF10656@dastard> <20151102201029.GI10656@dastard> <20151105083309.GJ19199@dastard> Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "H. Peter Anvin" , "J. Bruce Fields" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Dan Williams , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org, xfs@oss.sgi.com, Andrew Morton , Matthew Wilcox From: Jens Axboe Message-ID: <563BC1FB.60004@kernel.dk> Date: Thu, 5 Nov 2015 13:54:19 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 MIME-Version: 1.0 In-Reply-To: <20151105083309.GJ19199@dastard> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/05/2015 01:33 AM, Dave Chinner wrote: >> Des xfs rely on this model for correctness? If so, I'd say we've got a >> problem > > No, it doesn't. The XFS integrity model doesn't trust the IO layers > to tell the truth about IO ordering and completion or for it's > developers to fully understand how IO layer ordering works. :P That's good, because the storage developers simplified the model so that fs developers would be able to get and use it. > i.e. we wait for full completions of all dependent IO before issuing > flushes or log writes that use REQ_FLUSH|REQ_FUA semantics to ensure > the dependent IOs are fully caught by the cache flushes... ... which is what you are supposed to do, that's how it works. -- Jens Axboe