From mboxrd@z Thu Jan 1 00:00:00 1970 From: tytso@mit.edu Subject: Re: [PATCH v7 19/22] ext4: Make ext4_block_zero_page_range static Date: Mon, 24 Mar 2014 15:11:59 -0400 Message-ID: <20140324191158.GC6896@thunk.org> References: <6ae0bcd05c2e114d3c4a7803415b6c2c8a8dadd7.1395591795.git.matthew.r.wilcox@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com, linux-ext4@vger.kernel.org To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <6ae0bcd05c2e114d3c4a7803415b6c2c8a8dadd7.1395591795.git.matthew.r.wilcox@intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org On Sun, Mar 23, 2014 at 03:08:45PM -0400, Matthew Wilcox wrote: > It's only called within inode.c, so make it static, remove its prototype > from ext4.h and move it above all of its callers so it doesn't need a > prototype within inode.c. > > Signed-off-by: Matthew Wilcox Thanks, applied to the ext4 tree. - Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: tytso@mit.edu Subject: Re: [PATCH v7 21/22] ext4: Fix typos Date: Mon, 24 Mar 2014 15:16:14 -0400 Message-ID: <20140324191614.GD6896@thunk.org> References: <2b2c5467283817503fede11d12cba8aef912c9c5.1395591795.git.matthew.r.wilcox@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com, linux-ext4@vger.kernel.org To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <2b2c5467283817503fede11d12cba8aef912c9c5.1395591795.git.matthew.r.wilcox@intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org On Sun, Mar 23, 2014 at 03:08:47PM -0400, Matthew Wilcox wrote: > Comment fix only > > Signed-off-by: Matthew Wilcox Thanks, applied to the ext4 git tree. - Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: [PATCH v7 00/22] Support ext4 on NV-DIMMs Date: Sun, 23 Mar 2014 15:08:26 -0400 Message-ID: Cc: Matthew Wilcox , willy@linux.intel.com To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org One of the primary uses for NV-DIMMs is to expose them as a block device and use a filesystem to store files on the NV-DIMM. While that works, it currently wastes memory and CPU time buffering the files in the page cache. We have support in ext2 for bypassing the page cache, but it has some races which are unfixable in the current design. This series of patches rewrite the underlying support, and add support for direct access to ext4. This iteration of the patchset rebases to Linus' 3.14-rc7 (plus Kirill's patches in linux-next http://marc.info/?l=linux-mm&m=139206489208546&w=2) and fixes several bugs: - Initialise cow_page in do_page_mkwrite() (Matthew Wilcox) - Clear new or unwritten blocks in page fault handler (Matthew Wilcox) - Only call get_block when necessary (Matthew Wilcox) - Reword Kconfig options (Matthew Wilcox / Vishal Verma) - Fix a race between page fault and truncate (Matthew Wilcox) - Fix a race between fault-for-read and fault-for-write (Matthew Wilcox) - Zero the correct bytes in dax_new_buf() (Toshi Kani) - Add DIO_LOCKING to an invocation of dax_do_io in ext4 (Ross Zwisler) Relative to the last patchset, I folded the 'Add reporting of major faults' patch into the patch that adds the DAX page fault handler. The v6 patchset had seven additional xfstests failures. This patchset now passes approximately as many xfstests as ext4 does on a ramdisk. Matthew Wilcox (21): Fix XIP fault vs truncate race Allow page fault handlers to perform the COW axonram: Fix bug in direct_access Change direct_access calling convention Introduce IS_DAX(inode) Replace XIP read and write with DAX I/O Replace the XIP page fault handler with the DAX page fault handler Replace xip_truncate_page with dax_truncate_page Remove mm/filemap_xip.c Remove get_xip_mem Replace ext2_clear_xip_target with dax_clear_blocks ext2: Remove ext2_xip_verify_sb() ext2: Remove ext2_use_xip ext2: Remove xip.c and xip.h Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX ext2: Remove ext2_aops_xip Get rid of most mentions of XIP in ext2 xip: Add xip_zero_page_range ext4: Make ext4_block_zero_page_range static ext4: Fix typos brd: Rename XIP to DAX Ross Zwisler (1): ext4: Add DAX functionality Documentation/filesystems/Locking | 3 - Documentation/filesystems/dax.txt | 84 ++++++ Documentation/filesystems/ext4.txt | 2 + Documentation/filesystems/xip.txt | 68 ----- arch/powerpc/sysdev/axonram.c | 8 +- drivers/block/Kconfig | 13 +- drivers/block/brd.c | 22 +- drivers/s390/block/dcssblk.c | 19 +- fs/Kconfig | 21 +- fs/Makefile | 1 + fs/dax.c | 509 +++++++++++++++++++++++++++++++++++++ fs/exofs/inode.c | 1 - fs/ext2/Kconfig | 11 - fs/ext2/Makefile | 1 - fs/ext2/ext2.h | 9 +- fs/ext2/file.c | 45 +++- fs/ext2/inode.c | 37 +-- fs/ext2/namei.c | 13 +- fs/ext2/super.c | 48 ++-- fs/ext2/xip.c | 91 ------- fs/ext2/xip.h | 26 -- fs/ext4/ext4.h | 8 +- fs/ext4/file.c | 53 +++- fs/ext4/indirect.c | 19 +- fs/ext4/inode.c | 94 ++++--- fs/ext4/namei.c | 10 +- fs/ext4/super.c | 39 ++- fs/open.c | 5 +- include/linux/blkdev.h | 4 +- include/linux/fs.h | 49 +++- include/linux/mm.h | 2 + mm/Makefile | 1 - mm/fadvise.c | 6 +- mm/filemap.c | 6 +- mm/filemap_xip.c | 483 ----------------------------------- mm/madvise.c | 2 +- mm/memory.c | 45 +++- 37 files changed, 984 insertions(+), 874 deletions(-) create mode 100644 Documentation/filesystems/dax.txt delete mode 100644 Documentation/filesystems/xip.txt create mode 100644 fs/dax.c delete mode 100644 fs/ext2/xip.c delete mode 100644 fs/ext2/xip.h delete mode 100644 mm/filemap_xip.c -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: [PATCH v7 02/22] Allow page fault handlers to perform the COW Date: Sun, 23 Mar 2014 15:08:28 -0400 Message-ID: References: Cc: Matthew Wilcox , willy@linux.intel.com To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Currently COW of an XIP file is done by first bringing in a read-only mapping, then retrying the fault and copying the page. It is much more efficient to tell the fault handler that a COW is being attempted (by passing in the pre-allocated page in the vm_fault structure), and allow the handler to perform the COW operation itself. Where the filemap code protects against truncation of the file until the PTE has been installed with the page lock, the XIP code use the i_mmap_mutex instead. We must therefore unlock the i_mmap_mutex after inserting the PTE. Signed-off-by: Matthew Wilcox --- include/linux/mm.h | 2 ++ mm/memory.c | 45 +++++++++++++++++++++++++++++++++------------ 2 files changed, 35 insertions(+), 12 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index c1b7414..513b78a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -205,6 +205,7 @@ struct vm_fault { pgoff_t pgoff; /* Logical page offset based on vma */ void __user *virtual_address; /* Faulting virtual address */ + struct page *cow_page; /* Handler may choose to COW */ struct page *page; /* ->fault handlers should return a * page here, unless VM_FAULT_NOPAGE * is set (which is also implied by @@ -1010,6 +1011,7 @@ static inline int page_mapped(struct page *page) #define VM_FAULT_HWPOISON 0x0010 /* Hit poisoned small page */ #define VM_FAULT_HWPOISON_LARGE 0x0020 /* Hit poisoned large page. Index encoded in upper bits */ +#define VM_FAULT_COWED 0x0080 /* ->fault COWed the page instead */ #define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */ #define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */ #define VM_FAULT_RETRY 0x0400 /* ->fault blocked, must retry */ diff --git a/mm/memory.c b/mm/memory.c index 07b4287..2a2ecac 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2602,6 +2602,7 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page, vmf.pgoff = page->index; vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE; vmf.page = page; + vmf.cow_page = NULL; ret = vma->vm_ops->page_mkwrite(vma, &vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) @@ -3288,7 +3289,8 @@ oom: } static int __do_fault(struct vm_area_struct *vma, unsigned long address, - pgoff_t pgoff, unsigned int flags, struct page **page) + pgoff_t pgoff, unsigned int flags, + struct page *cow_page, struct page **page) { struct vm_fault vmf; int ret; @@ -3297,10 +3299,13 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address, vmf.pgoff = pgoff; vmf.flags = flags; vmf.page = NULL; + vmf.cow_page = cow_page; ret = vma->vm_ops->fault(vma, &vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) return ret; + if (unlikely(ret & VM_FAULT_COWED)) + goto out; if (unlikely(PageHWPoison(vmf.page))) { if (ret & VM_FAULT_LOCKED) @@ -3314,6 +3319,7 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address, else VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page); + out: *page = vmf.page; return ret; } @@ -3351,7 +3357,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma, pte_t *pte; int ret; - ret = __do_fault(vma, address, pgoff, flags, &fault_page); + ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) return ret; @@ -3368,6 +3374,12 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma, return ret; } +/* + * If the fault handler performs the COW, it does not return a page, + * so cannot use the page's lock to protect against a concurrent truncate + * operation. Instead it returns with the i_mmap_mutex held, which must + * be released after the PTE has been inserted. + */ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pgoff_t pgoff, unsigned int flags, pte_t orig_pte) @@ -3389,25 +3401,34 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma, return VM_FAULT_OOM; } - ret = __do_fault(vma, address, pgoff, flags, &fault_page); + ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) goto uncharge_out; - copy_user_highpage(new_page, fault_page, address, vma); + if (!(ret & VM_FAULT_COWED)) + copy_user_highpage(new_page, fault_page, address, vma); __SetPageUptodate(new_page); pte = pte_offset_map_lock(mm, pmd, address, &ptl); - if (unlikely(!pte_same(*pte, orig_pte))) { - pte_unmap_unlock(pte, ptl); + if (unlikely(!pte_same(*pte, orig_pte))) + goto unlock_out; + do_set_pte(vma, address, new_page, pte, true, true); + pte_unmap_unlock(pte, ptl); + if (ret & VM_FAULT_COWED) { + mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex); + } else { unlock_page(fault_page); page_cache_release(fault_page); - goto uncharge_out; } - do_set_pte(vma, address, new_page, pte, true, true); - pte_unmap_unlock(pte, ptl); - unlock_page(fault_page); - page_cache_release(fault_page); return ret; +unlock_out: + pte_unmap_unlock(pte, ptl); + if (ret & VM_FAULT_COWED) { + mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex); + } else { + unlock_page(fault_page); + page_cache_release(fault_page); + } uncharge_out: mem_cgroup_uncharge_page(new_page); page_cache_release(new_page); @@ -3424,7 +3445,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma, int dirtied = 0; int ret, tmp; - ret = __do_fault(vma, address, pgoff, flags, &fault_page); + ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) return ret; -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: [PATCH v7 05/22] Introduce IS_DAX(inode) Date: Sun, 23 Mar 2014 15:08:31 -0400 Message-ID: <6a8918c9a0fb37882179e3699b3e04d96540b24f.1395591795.git.matthew.r.wilcox@intel.com> References: Cc: Matthew Wilcox , willy@linux.intel.com To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Use an inode flag to tag inodes which should avoid using the page cache. Convert ext2 to use it instead of mapping_is_xip(). Signed-off-by: Matthew Wilcox --- fs/ext2/inode.c | 9 ++++++--- fs/ext2/xip.h | 2 -- include/linux/fs.h | 6 ++++++ 3 files changed, 12 insertions(+), 5 deletions(-) diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 94ed3684..e7d3192 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -731,7 +731,7 @@ static int ext2_get_blocks(struct inode *inode, goto cleanup; } - if (ext2_use_xip(inode->i_sb)) { + if (IS_DAX(inode)) { /* * we need to clear the block */ @@ -1201,7 +1201,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize) inode_dio_wait(inode); - if (mapping_is_xip(inode->i_mapping)) + if (IS_DAX(inode)) error = xip_truncate_page(inode->i_mapping, newsize); else if (test_opt(inode->i_sb, NOBH)) error = nobh_truncate_page(inode->i_mapping, @@ -1273,7 +1273,8 @@ void ext2_set_inode_flags(struct inode *inode) { unsigned int flags = EXT2_I(inode)->i_flags; - inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC); + inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | + S_DIRSYNC | S_DAX); if (flags & EXT2_SYNC_FL) inode->i_flags |= S_SYNC; if (flags & EXT2_APPEND_FL) @@ -1284,6 +1285,8 @@ void ext2_set_inode_flags(struct inode *inode) inode->i_flags |= S_NOATIME; if (flags & EXT2_DIRSYNC_FL) inode->i_flags |= S_DIRSYNC; + if (test_opt(inode->i_sb, XIP)) + inode->i_flags |= S_DAX; } /* Propagate flags from i_flags to EXT2_I(inode)->i_flags */ diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h index 18b34d2..29be737 100644 --- a/fs/ext2/xip.h +++ b/fs/ext2/xip.h @@ -16,9 +16,7 @@ static inline int ext2_use_xip (struct super_block *sb) } int ext2_get_xip_mem(struct address_space *, pgoff_t, int, void **, unsigned long *); -#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_mem) #else -#define mapping_is_xip(map) 0 #define ext2_xip_verify_sb(sb) do { } while (0) #define ext2_use_xip(sb) 0 #define ext2_clear_xip_target(inode, chain) 0 diff --git a/include/linux/fs.h b/include/linux/fs.h index 23b2a35..47fd219 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1644,6 +1644,7 @@ struct super_operations { #define S_IMA 1024 /* Inode has an associated IMA struct */ #define S_AUTOMOUNT 2048 /* Automount/referral quasi-directory */ #define S_NOSEC 4096 /* no suid or xattr security attributes */ +#define S_DAX 8192 /* Direct Access, avoiding the page cache */ /* * Note that nosuid etc flags are inode-specific: setting some file-system @@ -1681,6 +1682,11 @@ struct super_operations { #define IS_IMA(inode) ((inode)->i_flags & S_IMA) #define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT) #define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC) +#ifdef CONFIG_FS_XIP +#define IS_DAX(inode) ((inode)->i_flags & S_DAX) +#else +#define IS_DAX(inode) 0 +#endif /* * Inode state bits. Protected by inode->i_lock -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: [PATCH v7 04/22] Change direct_access calling convention Date: Sun, 23 Mar 2014 15:08:30 -0400 Message-ID: <214af2a38d840d0b8e983d39d03711d1292bc2d6.1395591795.git.matthew.r.wilcox@intel.com> References: Cc: Matthew Wilcox , willy@linux.intel.com To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org In order to support accesses to larger chunks of memory, pass in a 'size' parameter (counted in bytes), and return the amount available at that address. Signed-off-by: Matthew Wilcox --- Documentation/filesystems/xip.txt | 15 +++++++++------ arch/powerpc/sysdev/axonram.c | 6 +++--- drivers/block/brd.c | 8 +++++--- drivers/s390/block/dcssblk.c | 19 ++++++++++--------- fs/ext2/xip.c | 30 +++++++++++++----------------- include/linux/blkdev.h | 4 ++-- 6 files changed, 42 insertions(+), 40 deletions(-) diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt index 0466ee5..b62eabf 100644 --- a/Documentation/filesystems/xip.txt +++ b/Documentation/filesystems/xip.txt @@ -28,12 +28,15 @@ Implementation Execute-in-place is implemented in three steps: block device operation, address space operation, and file operations. -A block device operation named direct_access is used to retrieve a -reference (pointer) to a block on-disk. The reference is supposed to be -cpu-addressable, physical address and remain valid until the release operation -is performed. A struct block_device reference is used to address the device, -and a sector_t argument is used to identify the individual block. As an -alternative, memory technology devices can be used for this. +A block device operation named direct_access is used to translate the +block device sector number to a page frame number (pfn) that identifies +the physical page for the memory. It also returns a kernel virtual +address that can be used to access the memory. + +The direct_access method takes a 'size' parameter that indicates the +number of bytes being requested. The function should return the number +of bytes that it can provide, although it must not exceed the number of +bytes requested. It may also return a negative errno if an error occurs. The block device operation is optional, these block devices support it as of today: diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c index 830edc8..1697e29 100644 --- a/arch/powerpc/sysdev/axonram.c +++ b/arch/powerpc/sysdev/axonram.c @@ -139,9 +139,9 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio) * axon_ram_direct_access - direct_access() method for block device * @device, @sector, @data: see block_device_operations method */ -static int +static long axon_ram_direct_access(struct block_device *device, sector_t sector, - void **kaddr, unsigned long *pfn) + void **kaddr, unsigned long *pfn, long size) { struct axon_ram_bank *bank = device->bd_disk->private_data; loff_t offset; @@ -158,7 +158,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector, *kaddr = (void *)(bank->ph_addr + offset); *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT; - return 0; + return min_t(unsigned long, size, bank->size - offset); } static const struct block_device_operations axon_ram_devops = { diff --git a/drivers/block/brd.c b/drivers/block/brd.c index e73b85c..00da60d 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -361,8 +361,8 @@ out: } #ifdef CONFIG_BLK_DEV_XIP -static int brd_direct_access(struct block_device *bdev, sector_t sector, - void **kaddr, unsigned long *pfn) +static long brd_direct_access(struct block_device *bdev, sector_t sector, + void **kaddr, unsigned long *pfn, long size) { struct brd_device *brd = bdev->bd_disk->private_data; struct page *page; @@ -379,7 +379,9 @@ static int brd_direct_access(struct block_device *bdev, sector_t sector, *kaddr = page_address(page); *pfn = page_to_pfn(page); - return 0; + /* Could optimistically check to see if the next page in the + * file is mapped to the next page of physical RAM */ + return PAGE_SIZE; } #endif diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c index ebf41e2..da914b2 100644 --- a/drivers/s390/block/dcssblk.c +++ b/drivers/s390/block/dcssblk.c @@ -28,8 +28,8 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode); static void dcssblk_release(struct gendisk *disk, fmode_t mode); static void dcssblk_make_request(struct request_queue *q, struct bio *bio); -static int dcssblk_direct_access(struct block_device *bdev, sector_t secnum, - void **kaddr, unsigned long *pfn); +static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum, + void **kaddr, unsigned long *pfn, long size); static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0"; @@ -866,25 +866,26 @@ fail: bio_io_error(bio); } -static int +static long dcssblk_direct_access (struct block_device *bdev, sector_t secnum, - void **kaddr, unsigned long *pfn) + void **kaddr, unsigned long *pfn, long size) { struct dcssblk_dev_info *dev_info; - unsigned long pgoff; + unsigned long offset, dev_sz; dev_info = bdev->bd_disk->private_data; if (!dev_info) return -ENODEV; + dev_sz = dev_info->end - dev_info->start; if (secnum % (PAGE_SIZE/512)) return -EINVAL; - pgoff = secnum / (PAGE_SIZE / 512); - if ((pgoff+1)*PAGE_SIZE-1 > dev_info->end - dev_info->start) + offset = secnum * 512; + if (offset > dev_sz) return -ERANGE; - *kaddr = (void *) (dev_info->start+pgoff*PAGE_SIZE); + *kaddr = (void *) (dev_info->start + offset); *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT; - return 0; + return min_t(unsigned long, size, dev_sz - offset); } static void diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c index e98171a..fa40091 100644 --- a/fs/ext2/xip.c +++ b/fs/ext2/xip.c @@ -13,18 +13,13 @@ #include "ext2.h" #include "xip.h" -static inline int -__inode_direct_access(struct inode *inode, sector_t block, - void **kaddr, unsigned long *pfn) +static inline long __inode_direct_access(struct inode *inode, sector_t block, + void **kaddr, unsigned long *pfn, long size) { struct block_device *bdev = inode->i_sb->s_bdev; const struct block_device_operations *ops = bdev->bd_disk->fops; - sector_t sector; - - sector = block * (PAGE_SIZE / 512); /* ext2 block to bdev sector */ - - BUG_ON(!ops->direct_access); - return ops->direct_access(bdev, sector, kaddr, pfn); + sector_t sector = block * (PAGE_SIZE / 512); + return ops->direct_access(bdev, sector, kaddr, pfn, size); } static inline int @@ -53,12 +48,13 @@ ext2_clear_xip_target(struct inode *inode, sector_t block) { void *kaddr; unsigned long pfn; - int rc; + long size; - rc = __inode_direct_access(inode, block, &kaddr, &pfn); - if (!rc) - clear_page(kaddr); - return rc; + size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE); + if (size < 0) + return size; + clear_page(kaddr); + return 0; } void ext2_xip_verify_sb(struct super_block *sb) @@ -77,7 +73,7 @@ void ext2_xip_verify_sb(struct super_block *sb) int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create, void **kmem, unsigned long *pfn) { - int rc; + long rc; sector_t block; /* first, retrieve the sector number */ @@ -86,6 +82,6 @@ int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create, return rc; /* retrieve address of the target data */ - rc = __inode_direct_access(mapping->host, block, kmem, pfn); - return rc; + rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE); + return (rc < 0) ? rc : 0; } diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 4afa4f8..c6f6210 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1560,8 +1560,8 @@ struct block_device_operations { void (*release) (struct gendisk *, fmode_t); int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long); int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long); - int (*direct_access) (struct block_device *, sector_t, - void **, unsigned long *); + long (*direct_access) (struct block_device *, sector_t, + void **, unsigned long *pfn, long size); unsigned int (*check_events) (struct gendisk *disk, unsigned int clearing); /* ->media_changed() is DEPRECATED, use ->check_events() instead */ -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Sun, 23 Mar 2014 15:08:33 -0400 Message-ID: References: Cc: Matthew Wilcox , willy@linux.intel.com To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Instead of calling aops->get_xip_mem from the fault handler, the filesystem passes a get_block_t that is used to find the appropriate blocks. Signed-off-by: Matthew Wilcox --- fs/dax.c | 207 +++++++++++++++++++++++++++++++++++++++++++++++++++++ fs/ext2/file.c | 35 ++++++++- include/linux/fs.h | 4 +- mm/filemap_xip.c | 206 ---------------------------------------------------- 4 files changed, 243 insertions(+), 209 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 66a6bda..863749c 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -19,8 +19,12 @@ #include #include #include +#include +#include +#include #include #include +#include static long dax_get_addr(struct inode *inode, struct buffer_head *bh, void **addr) @@ -32,6 +36,16 @@ static long dax_get_addr(struct inode *inode, struct buffer_head *bh, return ops->direct_access(bdev, sector, addr, &pfn, bh->b_size); } +static long dax_get_pfn(struct inode *inode, struct buffer_head *bh, + unsigned long *pfn) +{ + struct block_device *bdev = bh->b_bdev; + const struct block_device_operations *ops = bdev->bd_disk->fops; + void *addr; + sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9); + return ops->direct_access(bdev, sector, &addr, pfn, bh->b_size); +} + static void dax_new_buf(void *addr, unsigned size, unsigned first, loff_t offset, loff_t end, int rw) { @@ -214,3 +228,196 @@ ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode, return retval; } EXPORT_SYMBOL_GPL(dax_do_io); + +/* + * The user has performed a load from a hole in the file. Allocating + * a new page in the file would cause excessive storage usage for + * workloads with sparse files. We allocate a page cache page instead. + * We'll kick it out of the page cache if it's ever written to, + * otherwise it will simply fall out of the page cache under memory + * pressure without ever having been dirtied. + */ +static int dax_load_hole(struct address_space *mapping, struct page *page, + struct vm_fault *vmf) +{ + unsigned long size; + struct inode *inode = mapping->host; + if (!page) + page = find_or_create_page(mapping, vmf->pgoff, + GFP_KERNEL | __GFP_ZERO); + if (!page) + return VM_FAULT_OOM; + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; + if (vmf->pgoff >= size) { + unlock_page(page); + page_cache_release(page); + return VM_FAULT_SIGBUS; + } + + vmf->page = page; + return VM_FAULT_LOCKED; +} + +static void copy_user_bh(struct page *to, struct inode *inode, + struct buffer_head *bh, unsigned long vaddr) +{ + void *vfrom, *vto; + dax_get_addr(inode, bh, &vfrom); /* XXX: error handling */ + vto = kmap_atomic(to); + copy_user_page(vto, vfrom, vaddr, to); + kunmap_atomic(vto); +} + +static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, + get_block_t get_block) +{ + struct file *file = vma->vm_file; + struct inode *inode = file_inode(file); + struct address_space *mapping = file->f_mapping; + struct page *page; + struct buffer_head bh; + unsigned long vaddr = (unsigned long)vmf->virtual_address; + sector_t block; + pgoff_t size; + unsigned long pfn; + int error; + int major = 0; + + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; + if (vmf->pgoff >= size) + return VM_FAULT_SIGBUS; + + memset(&bh, 0, sizeof(bh)); + block = (sector_t)vmf->pgoff << (PAGE_SHIFT - inode->i_blkbits); + bh.b_size = PAGE_SIZE; + + repeat: + page = find_get_page(mapping, vmf->pgoff); + if (page) { + if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) { + page_cache_release(page); + return VM_FAULT_RETRY; + } + if (unlikely(page->mapping != mapping)) { + unlock_page(page); + page_cache_release(page); + goto repeat; + } + } + + error = get_block(inode, block, &bh, 0); + if (error || bh.b_size < PAGE_SIZE) + goto sigbus; + + if (!buffer_written(&bh) && !vmf->cow_page) { + if (vmf->flags & FAULT_FLAG_WRITE) { + error = get_block(inode, block, &bh, 1); + count_vm_event(PGMAJFAULT); + mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); + major = VM_FAULT_MAJOR; + if (error || bh.b_size < PAGE_SIZE) + goto sigbus; + } else { + return dax_load_hole(mapping, page, vmf); + } + } + + /* Recheck i_size under i_mmap_mutex */ + mutex_lock(&mapping->i_mmap_mutex); + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; + if (unlikely(vmf->pgoff >= size)) { + mutex_unlock(&mapping->i_mmap_mutex); + goto sigbus; + } + if (vmf->cow_page) { + if (buffer_written(&bh)) + copy_user_bh(vmf->cow_page, inode, &bh, vaddr); + else + clear_user_highpage(vmf->cow_page, vaddr); + if (page) { + unlock_page(page); + page_cache_release(page); + } + /* do_cow_fault() will release the i_mmap_mutex */ + return VM_FAULT_COWED; + } + + if (buffer_unwritten(&bh) || buffer_new(&bh)) + dax_clear_blocks(inode, bh.b_blocknr, bh.b_size); + + error = dax_get_pfn(inode, &bh, &pfn); + if (error > 0) + error = vm_insert_mixed(vma, vaddr, pfn); + mutex_unlock(&mapping->i_mmap_mutex); + + if (page) { + delete_from_page_cache(page); + unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, + PAGE_CACHE_SIZE, 0); + unlock_page(page); + page_cache_release(page); + } + + if (error == -ENOMEM) + return VM_FAULT_OOM; + /* -EBUSY is fine, somebody else faulted on the same PTE */ + if (error != -EBUSY) + BUG_ON(error); + return VM_FAULT_NOPAGE | major; + + sigbus: + if (page) { + unlock_page(page); + page_cache_release(page); + } + return VM_FAULT_SIGBUS; +} + +/** + * dax_fault - handle a page fault on an XIP file + * @vma: The virtual memory area where the fault occurred + * @vmf: The description of the fault + * @get_block: The filesystem method used to translate file offsets to blocks + * + * When a page fault occurs, filesystems may call this helper in their + * fault handler for XIP files. + */ +int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, + get_block_t get_block) +{ + int result; + struct super_block *sb = file_inode(vma->vm_file)->i_sb; + + sb_start_pagefault(sb); + file_update_time(vma->vm_file); + result = do_dax_fault(vma, vmf, get_block); + sb_end_pagefault(sb); + + return result; +} +EXPORT_SYMBOL_GPL(dax_fault); + +/** + * dax_mkwrite - convert a read-only page to read-write in an XIP file + * @vma: The virtual memory area where the fault occurred + * @vmf: The description of the fault + * @get_block: The filesystem method used to translate file offsets to blocks + * + * XIP handles reads of holes by adding pages full of zeroes into the + * mapping. If the page is subsequenty written to, we have to allocate + * the page on media and free the page that was in the cache. + */ +int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, + get_block_t get_block) +{ + int result; + struct super_block *sb = file_inode(vma->vm_file)->i_sb; + + sb_start_pagefault(sb); + file_update_time(vma->vm_file); + result = do_dax_fault(vma, vmf, get_block); + sb_end_pagefault(sb); + + return result; +} +EXPORT_SYMBOL_GPL(dax_mkwrite); diff --git a/fs/ext2/file.c b/fs/ext2/file.c index ef5cf96..e3ce10d 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -25,6 +25,37 @@ #include "xattr.h" #include "acl.h" +#ifdef CONFIG_EXT2_FS_XIP +static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + return dax_fault(vma, vmf, ext2_get_block); +} + +static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + return dax_mkwrite(vma, vmf, ext2_get_block); +} + +static const struct vm_operations_struct ext2_dax_vm_ops = { + .fault = ext2_dax_fault, + .page_mkwrite = ext2_dax_mkwrite, + .remap_pages = generic_file_remap_pages, +}; + +static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma) +{ + if (!IS_DAX(file_inode(file))) + return generic_file_mmap(file, vma); + + file_accessed(file); + vma->vm_ops = &ext2_dax_vm_ops; + vma->vm_flags |= VM_MIXEDMAP; + return 0; +} +#else +#define ext2_file_mmap generic_file_mmap +#endif + /* * Called when filp is released. This happens when all file descriptors * for a single struct file are closed. Note that different open() calls @@ -70,7 +101,7 @@ const struct file_operations ext2_file_operations = { #ifdef CONFIG_COMPAT .compat_ioctl = ext2_compat_ioctl, #endif - .mmap = generic_file_mmap, + .mmap = ext2_file_mmap, .open = dquot_file_open, .release = ext2_release_file, .fsync = ext2_fsync, @@ -89,7 +120,7 @@ const struct file_operations ext2_xip_file_operations = { #ifdef CONFIG_COMPAT .compat_ioctl = ext2_compat_ioctl, #endif - .mmap = xip_file_mmap, + .mmap = ext2_file_mmap, .open = dquot_file_open, .release = ext2_release_file, .fsync = ext2_fsync, diff --git a/include/linux/fs.h b/include/linux/fs.h index dabc601..1607812 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -48,6 +48,7 @@ struct cred; struct swap_info_struct; struct seq_file; struct workqueue_struct; +struct vm_fault; extern void __init inode_init(void); extern void __init inode_init_early(void); @@ -2521,10 +2522,11 @@ extern int generic_file_open(struct inode * inode, struct file * filp); extern int nonseekable_open(struct inode * inode, struct file * filp); #ifdef CONFIG_FS_XIP -extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma); extern int xip_truncate_page(struct address_space *mapping, loff_t from); ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *, loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags); +int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t); +int dax_mkwrite(struct vm_area_struct *, struct vm_fault *, get_block_t); #else static inline int xip_truncate_page(struct address_space *mapping, loff_t from) { diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c index f7c37a1..9dd45f3 100644 --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -22,212 +22,6 @@ #include /* - * We do use our own empty page to avoid interference with other users - * of ZERO_PAGE(), such as /dev/zero - */ -static DEFINE_MUTEX(xip_sparse_mutex); -static seqcount_t xip_sparse_seq = SEQCNT_ZERO(xip_sparse_seq); -static struct page *__xip_sparse_page; - -/* called under xip_sparse_mutex */ -static struct page *xip_sparse_page(void) -{ - if (!__xip_sparse_page) { - struct page *page = alloc_page(GFP_HIGHUSER | __GFP_ZERO); - - if (page) - __xip_sparse_page = page; - } - return __xip_sparse_page; -} - -/* - * __xip_unmap is invoked from xip_unmap and - * xip_write - * - * This function walks all vmas of the address_space and unmaps the - * __xip_sparse_page when found at pgoff. - */ -static void -__xip_unmap (struct address_space * mapping, - unsigned long pgoff) -{ - struct vm_area_struct *vma; - struct mm_struct *mm; - unsigned long address; - pte_t *pte; - pte_t pteval; - spinlock_t *ptl; - struct page *page; - unsigned count; - int locked = 0; - - count = read_seqcount_begin(&xip_sparse_seq); - - page = __xip_sparse_page; - if (!page) - return; - -retry: - mutex_lock(&mapping->i_mmap_mutex); - vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { - mm = vma->vm_mm; - address = vma->vm_start + - ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); - BUG_ON(address < vma->vm_start || address >= vma->vm_end); - pte = page_check_address(page, mm, address, &ptl, 1); - if (pte) { - /* Nuke the page table entry. */ - flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush(vma, address, pte); - page_remove_rmap(page); - dec_mm_counter(mm, MM_FILEPAGES); - BUG_ON(pte_dirty(pteval)); - pte_unmap_unlock(pte, ptl); - /* must invalidate_page _before_ freeing the page */ - mmu_notifier_invalidate_page(mm, address); - page_cache_release(page); - } - } - mutex_unlock(&mapping->i_mmap_mutex); - - if (locked) { - mutex_unlock(&xip_sparse_mutex); - } else if (read_seqcount_retry(&xip_sparse_seq, count)) { - mutex_lock(&xip_sparse_mutex); - locked = 1; - goto retry; - } -} - -/* - * xip_fault() is invoked via the vma operations vector for a - * mapped memory region to read in file data during a page fault. - * - * This function is derived from filemap_fault, but used for execute in place - */ -static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf) -{ - struct file *file = vma->vm_file; - struct address_space *mapping = file->f_mapping; - struct inode *inode = mapping->host; - pgoff_t size; - void *xip_mem; - unsigned long xip_pfn; - struct page *page; - int error; - - /* XXX: are VM_FAULT_ codes OK? */ -again: - size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; - if (vmf->pgoff >= size) - return VM_FAULT_SIGBUS; - - error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0, - &xip_mem, &xip_pfn); - if (likely(!error)) - goto found; - if (error != -ENODATA) - return VM_FAULT_OOM; - - /* sparse block */ - if ((vma->vm_flags & (VM_WRITE | VM_MAYWRITE)) && - (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) && - (!(mapping->host->i_sb->s_flags & MS_RDONLY))) { - int err; - - /* maybe shared writable, allocate new block */ - mutex_lock(&xip_sparse_mutex); - error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 1, - &xip_mem, &xip_pfn); - mutex_unlock(&xip_sparse_mutex); - if (error) - return VM_FAULT_SIGBUS; - /* unmap sparse mappings at pgoff from all other vmas */ - __xip_unmap(mapping, vmf->pgoff); - -found: - /* We must recheck i_size under i_mmap_mutex */ - mutex_lock(&mapping->i_mmap_mutex); - size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> - PAGE_CACHE_SHIFT; - if (unlikely(vmf->pgoff >= size)) { - mutex_unlock(&mapping->i_mmap_mutex); - return VM_FAULT_SIGBUS; - } - err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address, - xip_pfn); - mutex_unlock(&mapping->i_mmap_mutex); - if (err == -ENOMEM) - return VM_FAULT_OOM; - /* - * err == -EBUSY is fine, we've raced against another thread - * that faulted-in the same page - */ - if (err != -EBUSY) - BUG_ON(err); - return VM_FAULT_NOPAGE; - } else { - int err, ret = VM_FAULT_OOM; - - mutex_lock(&xip_sparse_mutex); - write_seqcount_begin(&xip_sparse_seq); - error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0, - &xip_mem, &xip_pfn); - if (unlikely(!error)) { - write_seqcount_end(&xip_sparse_seq); - mutex_unlock(&xip_sparse_mutex); - goto again; - } - if (error != -ENODATA) - goto out; - - /* We must recheck i_size under i_mmap_mutex */ - mutex_lock(&mapping->i_mmap_mutex); - size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> - PAGE_CACHE_SHIFT; - if (unlikely(vmf->pgoff >= size)) { - ret = VM_FAULT_SIGBUS; - goto unlock; - } - /* not shared and writable, use xip_sparse_page() */ - page = xip_sparse_page(); - if (!page) - goto unlock; - err = vm_insert_page(vma, (unsigned long)vmf->virtual_address, - page); - if (err == -ENOMEM) - goto unlock; - - ret = VM_FAULT_NOPAGE; -unlock: - mutex_unlock(&mapping->i_mmap_mutex); -out: - write_seqcount_end(&xip_sparse_seq); - mutex_unlock(&xip_sparse_mutex); - - return ret; - } -} - -static const struct vm_operations_struct xip_file_vm_ops = { - .fault = xip_file_fault, - .page_mkwrite = filemap_page_mkwrite, - .remap_pages = generic_file_remap_pages, -}; - -int xip_file_mmap(struct file * file, struct vm_area_struct * vma) -{ - BUG_ON(!file->f_mapping->a_ops->get_xip_mem); - - file_accessed(file); - vma->vm_ops = &xip_file_vm_ops; - vma->vm_flags |= VM_MIXEDMAP; - return 0; -} -EXPORT_SYMBOL_GPL(xip_file_mmap); - -/* * truncate a page used for execute in place * functionality is analog to block_truncate_page but does use get_xip_mem * to get the page instead of page cache -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: [PATCH v7 11/22] Replace ext2_clear_xip_target with dax_clear_blocks Date: Sun, 23 Mar 2014 15:08:37 -0400 Message-ID: References: Cc: Matthew Wilcox , willy@linux.intel.com To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org This is practically generic code; other filesystems will want to call it from other places, but there's nothing ext2-specific about it. Make it a little more generic by allowing it to take a count of the number of bytes to zero rather than fixing it to a single page. Thanks to Dave Hansen for suggesting that I need to call cond_resched() if zeroing more than one page. Signed-off-by: Matthew Wilcox --- fs/dax.c | 34 ++++++++++++++++++++++++++++++++++ fs/ext2/inode.c | 8 +++++--- fs/ext2/xip.c | 23 ----------------------- fs/ext2/xip.h | 3 --- include/linux/fs.h | 6 ++++++ 5 files changed, 45 insertions(+), 29 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 7271be0..45a0a41 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -23,9 +23,43 @@ #include #include #include +#include #include #include +int dax_clear_blocks(struct inode *inode, sector_t block, long size) +{ + struct block_device *bdev = inode->i_sb->s_bdev; + const struct block_device_operations *ops = bdev->bd_disk->fops; + sector_t sector = block << (inode->i_blkbits - 9); + unsigned long pfn; + + might_sleep(); + do { + void *addr; + long count = ops->direct_access(bdev, sector, &addr, &pfn, + size); + if (count < 0) + return count; + while (count >= PAGE_SIZE) { + clear_page(addr); + addr += PAGE_SIZE; + size -= PAGE_SIZE; + count -= PAGE_SIZE; + sector += PAGE_SIZE / 512; + cond_resched(); + } + if (count > 0) { + memset(addr, 0, count); + sector += count / 512; + size -= count; + } + } while (size); + + return 0; +} +EXPORT_SYMBOL_GPL(dax_clear_blocks); + static long dax_get_addr(struct inode *inode, struct buffer_head *bh, void **addr) { diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index b156fe8..a9346a9 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -733,10 +733,12 @@ static int ext2_get_blocks(struct inode *inode, if (IS_DAX(inode)) { /* - * we need to clear the block + * block must be initialised before we put it in the tree + * so that it's not found by another thread before it's + * initialised */ - err = ext2_clear_xip_target (inode, - le32_to_cpu(chain[depth-1].key)); + err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key), + count << inode->i_blkbits); if (err) { mutex_unlock(&ei->truncate_mutex); goto cleanup; diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c index ca745ff..132d4da 100644 --- a/fs/ext2/xip.c +++ b/fs/ext2/xip.c @@ -13,29 +13,6 @@ #include "ext2.h" #include "xip.h" -static inline long __inode_direct_access(struct inode *inode, sector_t block, - void **kaddr, unsigned long *pfn, long size) -{ - struct block_device *bdev = inode->i_sb->s_bdev; - const struct block_device_operations *ops = bdev->bd_disk->fops; - sector_t sector = block * (PAGE_SIZE / 512); - return ops->direct_access(bdev, sector, kaddr, pfn, size); -} - -int -ext2_clear_xip_target(struct inode *inode, sector_t block) -{ - void *kaddr; - unsigned long pfn; - long size; - - size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE); - if (size < 0) - return size; - clear_page(kaddr); - return 0; -} - void ext2_xip_verify_sb(struct super_block *sb) { struct ext2_sb_info *sbi = EXT2_SB(sb); diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h index 0fa8b7f..e7b9f0a 100644 --- a/fs/ext2/xip.h +++ b/fs/ext2/xip.h @@ -7,8 +7,6 @@ #ifdef CONFIG_EXT2_FS_XIP extern void ext2_xip_verify_sb (struct super_block *); -extern int ext2_clear_xip_target (struct inode *, sector_t); - static inline int ext2_use_xip (struct super_block *sb) { struct ext2_sb_info *sbi = EXT2_SB(sb); @@ -17,5 +15,4 @@ static inline int ext2_use_xip (struct super_block *sb) #else #define ext2_xip_verify_sb(sb) do { } while (0) #define ext2_use_xip(sb) 0 -#define ext2_clear_xip_target(inode, chain) 0 #endif diff --git a/include/linux/fs.h b/include/linux/fs.h index c777056..aeab3fda 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2520,12 +2520,18 @@ extern int generic_file_open(struct inode * inode, struct file * filp); extern int nonseekable_open(struct inode * inode, struct file * filp); #ifdef CONFIG_FS_XIP +int dax_clear_blocks(struct inode *, sector_t block, long size); int dax_truncate_page(struct inode *, loff_t from, get_block_t); ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *, loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags); int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t); int dax_mkwrite(struct vm_area_struct *, struct vm_fault *, get_block_t); #else +static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz) +{ + return 0; +} + static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb) { return 0; -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: [PATCH v7 10/22] Remove get_xip_mem Date: Sun, 23 Mar 2014 15:08:36 -0400 Message-ID: References: Cc: Matthew Wilcox , willy@linux.intel.com To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org All callers of get_xip_mem() are now gone. Remove checks for it, initialisers of it, documentation of it and the only implementation of it. Add documentation for writing a filesystem that supports DAX. Signed-off-by: Matthew Wilcox Reviewed-by: Randy Dunlap --- Documentation/filesystems/Locking | 3 -- Documentation/filesystems/dax.txt | 82 +++++++++++++++++++++++++++++++++++++++ Documentation/filesystems/xip.txt | 71 --------------------------------- fs/exofs/inode.c | 1 - fs/ext2/inode.c | 1 - fs/ext2/xip.c | 37 ------------------ fs/ext2/xip.h | 3 -- fs/open.c | 5 +-- include/linux/fs.h | 2 - mm/fadvise.c | 6 ++- mm/madvise.c | 2 +- 11 files changed, 88 insertions(+), 125 deletions(-) create mode 100644 Documentation/filesystems/dax.txt delete mode 100644 Documentation/filesystems/xip.txt diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 5b0c083..2780d47 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -194,8 +194,6 @@ prototypes: void (*freepage)(struct page *); int (*direct_IO)(int, struct kiocb *, const struct iovec *iov, loff_t offset, unsigned long nr_segs); - int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **, - unsigned long *); int (*migratepage)(struct address_space *, struct page *, struct page *); int (*launder_page)(struct page *); int (*is_partially_uptodate)(struct page *, read_descriptor_t *, unsigned long); @@ -220,7 +218,6 @@ invalidatepage: yes releasepage: yes freepage: yes direct_IO: -get_xip_mem: maybe migratepage: yes (both) launder_page: yes is_partially_uptodate: yes diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt new file mode 100644 index 0000000..06f84e5 --- /dev/null +++ b/Documentation/filesystems/dax.txt @@ -0,0 +1,82 @@ +Execute-in-place for file mappings +---------------------------------- + +Motivation +---------- + +File mappings are usually performed by mapping page cache pages to +userspace. In addition, read & write file operations also transfer data +between the page cache and storage. + +For memory backed storage devices that use the block device interface, +the page cache pages are just copies of the original storage. The +execute-in-place code removes the extra copy by performing reads and +writes directly on the memory backed storage device. For file mappings, +the storage device itself is mapped directly into userspace. + + +Implementation Tips for Block Driver Writers +-------------------------------------------- + +To support DAX in your block driver, implement the 'direct_access' +block device operation. It is used to translate the sector number +(expressed in units of 512-byte sectors) to a page frame number (pfn) +that identifies the physical page for the memory. It also returns a +kernel virtual address that can be used to access the memory. + +The direct_access method takes a 'size' parameter that indicates the +number of bytes being requested. The function should return the number +of bytes that it can provide, although it must not exceed the number of +bytes requested. It may also return a negative errno if an error occurs. + +In order to support this method, the storage must be byte-accessible by +the CPU at all times. If your device uses paging techniques to expose +a large amount of memory through a smaller window, then you cannot +implement direct_access. Equally, if your device can occasionally +stall the CPU for an extended period, you should also not attempt to +implement direct_access. + +These block devices may be used for inspiration: +- axonram: Axon DDR2 device driver +- brd: RAM backed block device driver +- dcssblk: s390 dcss block device driver + + +Implementation Tips for Filesystem Writers +------------------------------------------ + +Filesystem support consists of +- adding support to mark inodes as being DAX by setting the S_DAX flag in + i_flags +- implementing the direct_IO address space operation, and calling + dax_do_io() instead of blockdev_direct_IO() if S_DAX is set +- implementing an mmap file operation for DAX files which sets the + VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers + for fault and page_mkwrite (which should probably call dax_fault() and + dax_mkwrite(), passing the appropriate get_block() callback) +- calling dax_truncate_page() instead of block_truncate_page() for DAX files +- ensuring that there is sufficient locking between reads, writes, + truncates and page faults + +The get_block() callback passed to the DAX functions may return +uninitialised extents. If it does, it must ensure that simultaneous +calls to get_block() (for example by a page-fault racing with a read() +or a write()) work correctly. + +These filesystems may be used for inspiration: +- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt + + +Shortcomings +------------ + +Even if the kernel or its modules are stored on a filesystem that supports +DAX on a block device that supports DAX, they will still be copied into RAM. + +Calling get_user_pages() on a range of user memory that has been mmaped +from a DAX file will fail as there are no 'struct page' to describe +those pages. This problem is being worked on. That means that O_DIRECT +reads/writes to those memory ranges from a non-DAX file will fail (note +that O_DIRECT reads/writes _of a DAX file_ do work, it is the memory +that is being accessed that is key here). Other things that will not +work include RDMA, sendfile() and splice(). diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt deleted file mode 100644 index b62eabf..0000000 --- a/Documentation/filesystems/xip.txt +++ /dev/null @@ -1,71 +0,0 @@ -Execute-in-place for file mappings ----------------------------------- - -Motivation ----------- -File mappings are performed by mapping page cache pages to userspace. In -addition, read&write type file operations also transfer data from/to the page -cache. - -For memory backed storage devices that use the block device interface, the page -cache pages are in fact copies of the original storage. Various approaches -exist to work around the need for an extra copy. The ramdisk driver for example -does read the data into the page cache, keeps a reference, and discards the -original data behind later on. - -Execute-in-place solves this issue the other way around: instead of keeping -data in the page cache, the need to have a page cache copy is eliminated -completely. With execute-in-place, read&write type operations are performed -directly from/to the memory backed storage device. For file mappings, the -storage device itself is mapped directly into userspace. - -This implementation was initially written for shared memory segments between -different virtual machines on s390 hardware to allow multiple machines to -share the same binaries and libraries. - -Implementation --------------- -Execute-in-place is implemented in three steps: block device operation, -address space operation, and file operations. - -A block device operation named direct_access is used to translate the -block device sector number to a page frame number (pfn) that identifies -the physical page for the memory. It also returns a kernel virtual -address that can be used to access the memory. - -The direct_access method takes a 'size' parameter that indicates the -number of bytes being requested. The function should return the number -of bytes that it can provide, although it must not exceed the number of -bytes requested. It may also return a negative errno if an error occurs. - -The block device operation is optional, these block devices support it as of -today: -- dcssblk: s390 dcss block device driver - -An address space operation named get_xip_mem is used to retrieve references -to a page frame number and a kernel address. To obtain these values a reference -to an address_space is provided. This function assigns values to the kmem and -pfn parameters. The third argument indicates whether the function should allocate -blocks if needed. - -This address space operation is mutually exclusive with readpage&writepage that -do page cache read/write operations. -The following filesystems support it as of today: -- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt - -A set of file operations that do utilize get_xip_page can be found in -mm/filemap_xip.c . The following file operation implementations are provided: -- aio_read/aio_write -- readv/writev -- sendfile - -The generic file operations do_sync_read/do_sync_write can be used to implement -classic synchronous IO calls. - -Shortcomings ------------- -This implementation is limited to storage devices that are cpu addressable at -all times (no highmem or such). It works well on rom/ram, but enhancements are -needed to make it work with flash in read+write mode. -Putting the Linux kernel and/or its modules on a xip filesystem does not mean -they are not copied. diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c index ee4317fa..f9a5bf6 100644 --- a/fs/exofs/inode.c +++ b/fs/exofs/inode.c @@ -985,7 +985,6 @@ const struct address_space_operations exofs_aops = { .direct_IO = exofs_direct_IO, /* With these NULL has special meaning or default is not exported */ - .get_xip_mem = NULL, .migratepage = NULL, .launder_page = NULL, .is_partially_uptodate = NULL, diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 252481f..b156fe8 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -891,7 +891,6 @@ const struct address_space_operations ext2_aops = { const struct address_space_operations ext2_aops_xip = { .bmap = ext2_bmap, - .get_xip_mem = ext2_get_xip_mem, .direct_IO = ext2_direct_IO, }; diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c index fa40091..ca745ff 100644 --- a/fs/ext2/xip.c +++ b/fs/ext2/xip.c @@ -22,27 +22,6 @@ static inline long __inode_direct_access(struct inode *inode, sector_t block, return ops->direct_access(bdev, sector, kaddr, pfn, size); } -static inline int -__ext2_get_block(struct inode *inode, pgoff_t pgoff, int create, - sector_t *result) -{ - struct buffer_head tmp; - int rc; - - memset(&tmp, 0, sizeof(struct buffer_head)); - tmp.b_size = 1 << inode->i_blkbits; - rc = ext2_get_block(inode, pgoff, &tmp, create); - *result = tmp.b_blocknr; - - /* did we get a sparse block (hole in the file)? */ - if (!tmp.b_blocknr && !rc) { - BUG_ON(create); - rc = -ENODATA; - } - - return rc; -} - int ext2_clear_xip_target(struct inode *inode, sector_t block) { @@ -69,19 +48,3 @@ void ext2_xip_verify_sb(struct super_block *sb) "not supported by bdev"); } } - -int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create, - void **kmem, unsigned long *pfn) -{ - long rc; - sector_t block; - - /* first, retrieve the sector number */ - rc = __ext2_get_block(mapping->host, pgoff, create, &block); - if (rc) - return rc; - - /* retrieve address of the target data */ - rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE); - return (rc < 0) ? rc : 0; -} diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h index 29be737..0fa8b7f 100644 --- a/fs/ext2/xip.h +++ b/fs/ext2/xip.h @@ -14,11 +14,8 @@ static inline int ext2_use_xip (struct super_block *sb) struct ext2_sb_info *sbi = EXT2_SB(sb); return (sbi->s_mount_opt & EXT2_MOUNT_XIP); } -int ext2_get_xip_mem(struct address_space *, pgoff_t, int, - void **, unsigned long *); #else #define ext2_xip_verify_sb(sb) do { } while (0) #define ext2_use_xip(sb) 0 #define ext2_clear_xip_target(inode, chain) 0 -#define ext2_get_xip_mem NULL #endif diff --git a/fs/open.c b/fs/open.c index b9ed8b2..bc9f002 100644 --- a/fs/open.c +++ b/fs/open.c @@ -665,11 +665,8 @@ int open_check_o_direct(struct file *f) { /* NB: we're sure to have correct a_ops only after f_op->open */ if (f->f_flags & O_DIRECT) { - if (!f->f_mapping->a_ops || - ((!f->f_mapping->a_ops->direct_IO) && - (!f->f_mapping->a_ops->get_xip_mem))) { + if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO) return -EINVAL; - } } return 0; } diff --git a/include/linux/fs.h b/include/linux/fs.h index 9752ae5..c777056 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -375,8 +375,6 @@ struct address_space_operations { void (*freepage)(struct page *); ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, loff_t offset, unsigned long nr_segs); - int (*get_xip_mem)(struct address_space *, pgoff_t, int, - void **, unsigned long *); /* * migrate the contents of a page to the specified target. If * migrate_mode is MIGRATE_ASYNC, it must not block. diff --git a/mm/fadvise.c b/mm/fadvise.c index 3bcfd81..1f1925f 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -28,6 +28,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) { struct fd f = fdget(fd); + struct inode *inode; struct address_space *mapping; struct backing_dev_info *bdi; loff_t endbyte; /* inclusive */ @@ -39,7 +40,8 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) if (!f.file) return -EBADF; - if (S_ISFIFO(file_inode(f.file)->i_mode)) { + inode = file_inode(f.file); + if (S_ISFIFO(inode->i_mode)) { ret = -ESPIPE; goto out; } @@ -50,7 +52,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) goto out; } - if (mapping->a_ops->get_xip_mem) { + if (IS_DAX(inode)) { switch (advice) { case POSIX_FADV_NORMAL: case POSIX_FADV_RANDOM: diff --git a/mm/madvise.c b/mm/madvise.c index 539eeb9..b6a2f52 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -236,7 +236,7 @@ static long madvise_willneed(struct vm_area_struct *vma, if (!file) return -EBADF; - if (file->f_mapping->a_ops->get_xip_mem) { + if (IS_DAX(file_inode(file))) { /* no bad return value, but ignore advice */ return 0; } -- 1.9.0 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: [PATCH v7 12/22] ext2: Remove ext2_xip_verify_sb() Date: Sun, 23 Mar 2014 15:08:38 -0400 Message-ID: <5f91cb658e1ee1b593be9fd719e8f204b0069031.1395591795.git.matthew.r.wilcox@intel.com> References: Cc: Matthew Wilcox , willy@linux.intel.com To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Jan Kara pointed out that calling ext2_xip_verify_sb() in ext2_remount() doesn't make sense, since changing the XIP option on remount isn't allowed. It also doesn't make sense to re-check whether blocksize is supported since it can't change between mounts. Replace the call to ext2_xip_verify_sb() in ext2_fill_super() with the equivalent check and delete the definition. Signed-off-by: Matthew Wilcox --- fs/ext2/super.c | 33 ++++++++++++--------------------- fs/ext2/xip.c | 12 ------------ fs/ext2/xip.h | 2 -- 3 files changed, 12 insertions(+), 35 deletions(-) diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 20d6697..3a1db39 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -868,9 +868,6 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) ((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0); - ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset - EXT2_MOUNT_XIP if not */ - if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV && (EXT2_HAS_COMPAT_FEATURE(sb, ~0U) || EXT2_HAS_RO_COMPAT_FEATURE(sb, ~0U) || @@ -900,11 +897,17 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size); - if (ext2_use_xip(sb) && blocksize != PAGE_SIZE) { - if (!silent) + if (sbi->s_mount_opt & EXT2_MOUNT_XIP) { + if (blocksize != PAGE_SIZE) { ext2_msg(sb, KERN_ERR, - "error: unsupported blocksize for xip"); - goto failed_mount; + "error: unsupported blocksize for xip"); + goto failed_mount; + } + if (!sb->s_bdev->bd_disk->fops->direct_access) { + ext2_msg(sb, KERN_ERR, + "error: device does not support xip"); + goto failed_mount; + } } /* If the blocksize doesn't match, re-read the thing.. */ @@ -1249,7 +1252,6 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data) { struct ext2_sb_info * sbi = EXT2_SB(sb); struct ext2_super_block * es; - unsigned long old_mount_opt = sbi->s_mount_opt; struct ext2_mount_options old_opts; unsigned long old_sb_flags; int err; @@ -1273,22 +1275,11 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data) sb->s_flags = (sb->s_flags & ~MS_POSIXACL) | ((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0); - ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset - EXT2_MOUNT_XIP if not */ - - if ((ext2_use_xip(sb)) && (sb->s_blocksize != PAGE_SIZE)) { - ext2_msg(sb, KERN_WARNING, - "warning: unsupported blocksize for xip"); - err = -EINVAL; - goto restore_opts; - } - es = sbi->s_es; - if ((sbi->s_mount_opt ^ old_mount_opt) & EXT2_MOUNT_XIP) { + if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) { ext2_msg(sb, KERN_WARNING, "warning: refusing change of " "xip flag with busy inodes while remounting"); - sbi->s_mount_opt &= ~EXT2_MOUNT_XIP; - sbi->s_mount_opt |= old_mount_opt & EXT2_MOUNT_XIP; + sbi->s_mount_opt ^= EXT2_MOUNT_XIP; } if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) { spin_unlock(&sbi->s_lock); diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c index 132d4da..66ca113 100644 --- a/fs/ext2/xip.c +++ b/fs/ext2/xip.c @@ -13,15 +13,3 @@ #include "ext2.h" #include "xip.h" -void ext2_xip_verify_sb(struct super_block *sb) -{ - struct ext2_sb_info *sbi = EXT2_SB(sb); - - if ((sbi->s_mount_opt & EXT2_MOUNT_XIP) && - !sb->s_bdev->bd_disk->fops->direct_access) { - sbi->s_mount_opt &= (~EXT2_MOUNT_XIP); - ext2_msg(sb, KERN_WARNING, - "warning: ignoring xip option - " - "not supported by bdev"); - } -} diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h index e7b9f0a..87eeb04 100644 --- a/fs/ext2/xip.h +++ b/fs/ext2/xip.h @@ -6,13 +6,11 @@ */ #ifdef CONFIG_EXT2_FS_XIP -extern void ext2_xip_verify_sb (struct super_block *); static inline int ext2_use_xip (struct super_block *sb) { struct ext2_sb_info *sbi = EXT2_SB(sb); return (sbi->s_mount_opt & EXT2_MOUNT_XIP); } #else -#define ext2_xip_verify_sb(sb) do { } while (0) #define ext2_use_xip(sb) 0 #endif -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: [PATCH v7 14/22] ext2: Remove xip.c and xip.h Date: Sun, 23 Mar 2014 15:08:40 -0400 Message-ID: <33ff0862f6d99b352429ef4494817544c3d5da68.1395591795.git.matthew.r.wilcox@intel.com> References: Cc: Matthew Wilcox , willy@linux.intel.com To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org These files are now empty, so delete them Signed-off-by: Matthew Wilcox --- fs/ext2/Makefile | 1 - fs/ext2/inode.c | 1 - fs/ext2/namei.c | 1 - fs/ext2/super.c | 1 - fs/ext2/xip.c | 15 --------------- fs/ext2/xip.h | 16 ---------------- 6 files changed, 35 deletions(-) delete mode 100644 fs/ext2/xip.c delete mode 100644 fs/ext2/xip.h diff --git a/fs/ext2/Makefile b/fs/ext2/Makefile index f42af45..445b0e9 100644 --- a/fs/ext2/Makefile +++ b/fs/ext2/Makefile @@ -10,4 +10,3 @@ ext2-y := balloc.o dir.o file.o ialloc.o inode.o \ ext2-$(CONFIG_EXT2_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o ext2-$(CONFIG_EXT2_FS_POSIX_ACL) += acl.o ext2-$(CONFIG_EXT2_FS_SECURITY) += xattr_security.o -ext2-$(CONFIG_EXT2_FS_XIP) += xip.o diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 2e587e2..67124f0 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -34,7 +34,6 @@ #include #include "ext2.h" #include "acl.h" -#include "xip.h" #include "xattr.h" static int __ext2_write_inode(struct inode *inode, int do_sync); diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c index 846c356..7ca803f 100644 --- a/fs/ext2/namei.c +++ b/fs/ext2/namei.c @@ -35,7 +35,6 @@ #include "ext2.h" #include "xattr.h" #include "acl.h" -#include "xip.h" static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode) { diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 3a1db39..752ccb4 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -35,7 +35,6 @@ #include "ext2.h" #include "xattr.h" #include "acl.h" -#include "xip.h" static void ext2_sync_super(struct super_block *sb, struct ext2_super_block *es, int wait); diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c deleted file mode 100644 index 66ca113..0000000 --- a/fs/ext2/xip.c +++ /dev/null @@ -1,15 +0,0 @@ -/* - * linux/fs/ext2/xip.c - * - * Copyright (C) 2005 IBM Corporation - * Author: Carsten Otte (cotte@de.ibm.com) - */ - -#include -#include -#include -#include -#include -#include "ext2.h" -#include "xip.h" - diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h deleted file mode 100644 index 87eeb04..0000000 --- a/fs/ext2/xip.h +++ /dev/null @@ -1,16 +0,0 @@ -/* - * linux/fs/ext2/xip.h - * - * Copyright (C) 2005 IBM Corporation - * Author: Carsten Otte (cotte@de.ibm.com) - */ - -#ifdef CONFIG_EXT2_FS_XIP -static inline int ext2_use_xip (struct super_block *sb) -{ - struct ext2_sb_info *sbi = EXT2_SB(sb); - return (sbi->s_mount_opt & EXT2_MOUNT_XIP); -} -#else -#define ext2_use_xip(sb) 0 -#endif -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: [PATCH v7 19/22] ext4: Make ext4_block_zero_page_range static Date: Sun, 23 Mar 2014 15:08:45 -0400 Message-ID: <6ae0bcd05c2e114d3c4a7803415b6c2c8a8dadd7.1395591795.git.matthew.r.wilcox@intel.com> References: Cc: Matthew Wilcox , willy@linux.intel.com To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org It's only called within inode.c, so make it static, remove its prototype from ext4.h and move it above all of its callers so it doesn't need a prototype within inode.c. Signed-off-by: Matthew Wilcox --- fs/ext4/ext4.h | 2 -- fs/ext4/inode.c | 42 +++++++++++++++++++++--------------------- 2 files changed, 21 insertions(+), 23 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index d3a534f..e025c29 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2133,8 +2133,6 @@ extern int ext4_writepage_trans_blocks(struct inode *); extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks); extern int ext4_block_truncate_page(handle_t *handle, struct address_space *mapping, loff_t from); -extern int ext4_block_zero_page_range(handle_t *handle, - struct address_space *mapping, loff_t from, loff_t length); extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode, loff_t lstart, loff_t lend); extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 6e39895..ce7341c 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3312,33 +3312,13 @@ void ext4_set_aops(struct inode *inode) } /* - * ext4_block_truncate_page() zeroes out a mapping from file offset `from' - * up to the end of the block which corresponds to `from'. - * This required during truncate. We need to physically zero the tail end - * of that block so it doesn't yield old data if the file is later grown. - */ -int ext4_block_truncate_page(handle_t *handle, - struct address_space *mapping, loff_t from) -{ - unsigned offset = from & (PAGE_CACHE_SIZE-1); - unsigned length; - unsigned blocksize; - struct inode *inode = mapping->host; - - blocksize = inode->i_sb->s_blocksize; - length = blocksize - (offset & (blocksize - 1)); - - return ext4_block_zero_page_range(handle, mapping, from, length); -} - -/* * ext4_block_zero_page_range() zeros out a mapping of length 'length' * starting from file offset 'from'. The range to be zero'd must * be contained with in one block. If the specified range exceeds * the end of the block it will be shortened to end of the block * that cooresponds to 'from' */ -int ext4_block_zero_page_range(handle_t *handle, +static int ext4_block_zero_page_range(handle_t *handle, struct address_space *mapping, loff_t from, loff_t length) { ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT; @@ -3428,6 +3408,26 @@ unlock: return err; } +/* + * ext4_block_truncate_page() zeroes out a mapping from file offset `from' + * up to the end of the block which corresponds to `from'. + * This required during truncate. We need to physically zero the tail end + * of that block so it doesn't yield old data if the file is later grown. + */ +int ext4_block_truncate_page(handle_t *handle, + struct address_space *mapping, loff_t from) +{ + unsigned offset = from & (PAGE_CACHE_SIZE-1); + unsigned length; + unsigned blocksize; + struct inode *inode = mapping->host; + + blocksize = inode->i_sb->s_blocksize; + length = blocksize - (offset & (blocksize - 1)); + + return ext4_block_zero_page_range(handle, mapping, from, length); +} + int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode, loff_t lstart, loff_t length) { -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: [PATCH v7 15/22] Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX Date: Sun, 23 Mar 2014 15:08:41 -0400 Message-ID: References: Cc: Matthew Wilcox , willy@linux.intel.com To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org The fewer Kconfig options we have the better. Use the generic CONFIG_FS_DAX to enable XIP support in ext2 as well as in the core. Signed-off-by: Matthew Wilcox --- fs/Kconfig | 21 ++++++++++++++------- fs/Makefile | 2 +- fs/ext2/Kconfig | 11 ----------- fs/ext2/ext2.h | 2 +- fs/ext2/file.c | 4 ++-- fs/ext2/super.c | 4 ++-- include/linux/fs.h | 4 ++-- 7 files changed, 22 insertions(+), 26 deletions(-) diff --git a/fs/Kconfig b/fs/Kconfig index 7385e54..620ab73 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -13,13 +13,6 @@ if BLOCK source "fs/ext2/Kconfig" source "fs/ext3/Kconfig" source "fs/ext4/Kconfig" - -config FS_XIP -# execute in place - bool - depends on EXT2_FS_XIP - default y - source "fs/jbd/Kconfig" source "fs/jbd2/Kconfig" @@ -40,6 +33,20 @@ source "fs/ocfs2/Kconfig" source "fs/btrfs/Kconfig" source "fs/nilfs2/Kconfig" +config FS_DAX + bool "Direct Access support" + depends on MMU + help + Direct Access (DAX) can be used on memory-backed block devices. + If the block device supports DAX and the filesystem supports DAX, + then you can avoid using the pagecache to buffer I/Os. Turning + on this option will compile in support for DAX; you will need to + mount the filesystem using the -o xip option. + + If you do not have a block device that is capable of using this, + or if unsure, say N. Saying Y will increase the size of the kernel + by about 2kB. + endif # BLOCK # Posix ACL utility routines diff --git a/fs/Makefile b/fs/Makefile index 2f194cd..b7e0a13 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -29,7 +29,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o obj-$(CONFIG_TIMERFD) += timerfd.o obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_AIO) += aio.o -obj-$(CONFIG_FS_XIP) += dax.o +obj-$(CONFIG_FS_DAX) += dax.o obj-$(CONFIG_FILE_LOCKING) += locks.o obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout.o diff --git a/fs/ext2/Kconfig b/fs/ext2/Kconfig index 14a6780..c634874e 100644 --- a/fs/ext2/Kconfig +++ b/fs/ext2/Kconfig @@ -42,14 +42,3 @@ config EXT2_FS_SECURITY If you are not using a security module that requires using extended attributes for file security labels, say N. - -config EXT2_FS_XIP - bool "Ext2 execute in place support" - depends on EXT2_FS && MMU - help - Execute in place can be used on memory-backed block devices. If you - enable this option, you can select to mount block devices which are - capable of this feature without using the page cache. - - If you do not use a block device that is capable of using this, - or if unsure, say N. diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h index 5ecf570..b30c3bd 100644 --- a/fs/ext2/ext2.h +++ b/fs/ext2/ext2.h @@ -380,7 +380,7 @@ struct ext2_inode { #define EXT2_MOUNT_NO_UID32 0x000200 /* Disable 32-bit UIDs */ #define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */ #define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */ -#ifdef CONFIG_FS_XIP +#ifdef CONFIG_FS_DAX #define EXT2_MOUNT_XIP 0x010000 /* Execute in place */ #else #define EXT2_MOUNT_XIP 0 diff --git a/fs/ext2/file.c b/fs/ext2/file.c index e3ce10d..ae7f000 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -25,7 +25,7 @@ #include "xattr.h" #include "acl.h" -#ifdef CONFIG_EXT2_FS_XIP +#ifdef CONFIG_FS_DAX static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { return dax_fault(vma, vmf, ext2_get_block); @@ -109,7 +109,7 @@ const struct file_operations ext2_file_operations = { .splice_write = generic_file_splice_write, }; -#ifdef CONFIG_EXT2_FS_XIP +#ifdef CONFIG_FS_DAX const struct file_operations ext2_xip_file_operations = { .llseek = generic_file_llseek, .read = do_sync_read, diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 752ccb4..fdcacf7 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -287,7 +287,7 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root) seq_puts(seq, ",grpquota"); #endif -#if defined(CONFIG_EXT2_FS_XIP) +#ifdef CONFIG_FS_DAX if (sbi->s_mount_opt & EXT2_MOUNT_XIP) seq_puts(seq, ",xip"); #endif @@ -549,7 +549,7 @@ static int parse_options(char *options, struct super_block *sb) break; #endif case Opt_xip: -#ifdef CONFIG_EXT2_FS_XIP +#ifdef CONFIG_FS_DAX set_opt (sbi->s_mount_opt, XIP); #else ext2_msg(sb, KERN_INFO, "xip option not supported"); diff --git a/include/linux/fs.h b/include/linux/fs.h index aeab3fda..bff394d 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1681,7 +1681,7 @@ struct super_operations { #define IS_IMA(inode) ((inode)->i_flags & S_IMA) #define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT) #define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC) -#ifdef CONFIG_FS_XIP +#ifdef CONFIG_FS_DAX #define IS_DAX(inode) ((inode)->i_flags & S_DAX) #else #define IS_DAX(inode) 0 @@ -2519,7 +2519,7 @@ extern loff_t fixed_size_llseek(struct file *file, loff_t offset, extern int generic_file_open(struct inode * inode, struct file * filp); extern int nonseekable_open(struct inode * inode, struct file * filp); -#ifdef CONFIG_FS_XIP +#ifdef CONFIG_FS_DAX int dax_clear_blocks(struct inode *, sector_t block, long size); int dax_truncate_page(struct inode *, loff_t from, get_block_t); ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *, -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: [PATCH v7 21/22] ext4: Fix typos Date: Sun, 23 Mar 2014 15:08:47 -0400 Message-ID: <2b2c5467283817503fede11d12cba8aef912c9c5.1395591795.git.matthew.r.wilcox@intel.com> References: Cc: Matthew Wilcox , willy@linux.intel.com To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Comment fix only Signed-off-by: Matthew Wilcox --- fs/ext4/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 9462730..14a9744 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3691,7 +3691,7 @@ void ext4_truncate(struct inode *inode) /* * There is a possibility that we're either freeing the inode - * or it completely new indode. In those cases we might not + * or it's a completely new inode. In those cases we might not * have i_mutex locked because it's not necessary. */ if (!(inode->i_state & (I_NEW|I_FREEING))) -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: [PATCH v7 03/22] axonram: Fix bug in direct_access Date: Sun, 23 Mar 2014 15:08:29 -0400 Message-ID: References: Cc: Matthew Wilcox , willy@linux.intel.com To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org The 'pfn' returned by axonram was completely bogus, and has been since 2008. Signed-off-by: Matthew Wilcox --- arch/powerpc/sysdev/axonram.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c index 47b6b9f..830edc8 100644 --- a/arch/powerpc/sysdev/axonram.c +++ b/arch/powerpc/sysdev/axonram.c @@ -156,7 +156,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector, } *kaddr = (void *)(bank->ph_addr + offset); - *pfn = virt_to_phys(kaddr) >> PAGE_SHIFT; + *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT; return 0; } -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: [PATCH v7 01/22] Fix XIP fault vs truncate race Date: Sun, 23 Mar 2014 15:08:27 -0400 Message-ID: <59d73a58d4cfbe190a16ce912bb2776d9cc95447.1395591795.git.matthew.r.wilcox@intel.com> References: Cc: Matthew Wilcox , willy@linux.intel.com To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Pagecache faults recheck i_size after taking the page lock to ensure that the fault didn't race against a truncate. We don't have a page to lock in the XIP case, so use the i_mmap_mutex instead. It is locked in the truncate path in unmap_mapping_range() after updating i_size. So while we hold it in the fault path, we are guaranteed that either i_size has already been updated in the truncate path, or that the truncate will subsequently call zap_page_range_single() and so remove the mapping we have just inserted. There is a window of time in which i_size has been reduced and the thread has a mapping to a page which will be removed from the file, but this is harmless as the page will not be allocated to a different purpose before the thread's access to it is revoked. Signed-off-by: Matthew Wilcox --- mm/filemap_xip.c | 24 ++++++++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c index d8d9fe3..c8d23e9 100644 --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -260,8 +260,17 @@ again: __xip_unmap(mapping, vmf->pgoff); found: + /* We must recheck i_size under i_mmap_mutex */ + mutex_lock(&mapping->i_mmap_mutex); + size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> + PAGE_CACHE_SHIFT; + if (unlikely(vmf->pgoff >= size)) { + mutex_unlock(&mapping->i_mmap_mutex); + return VM_FAULT_SIGBUS; + } err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address, xip_pfn); + mutex_unlock(&mapping->i_mmap_mutex); if (err == -ENOMEM) return VM_FAULT_OOM; /* @@ -285,16 +294,27 @@ found: } if (error != -ENODATA) goto out; + + /* We must recheck i_size under i_mmap_mutex */ + mutex_lock(&mapping->i_mmap_mutex); + size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> + PAGE_CACHE_SHIFT; + if (unlikely(vmf->pgoff >= size)) { + ret = VM_FAULT_SIGBUS; + goto unlock; + } /* not shared and writable, use xip_sparse_page() */ page = xip_sparse_page(); if (!page) - goto out; + goto unlock; err = vm_insert_page(vma, (unsigned long)vmf->virtual_address, page); if (err == -ENOMEM) - goto out; + goto unlock; ret = VM_FAULT_NOPAGE; +unlock: + mutex_unlock(&mapping->i_mmap_mutex); out: write_seqcount_end(&xip_sparse_seq); mutex_unlock(&xip_sparse_mutex); -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: [PATCH v7 09/22] Remove mm/filemap_xip.c Date: Sun, 23 Mar 2014 15:08:35 -0400 Message-ID: <69ab315f0124881ae74d9881c48c7bdc70368fd1.1395591795.git.matthew.r.wilcox@intel.com> References: Cc: Matthew Wilcox , willy@linux.intel.com To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org It is now empty as all of its contents have been replaced by fs/xip.c Signed-off-by: Matthew Wilcox --- mm/Makefile | 1 - mm/filemap_xip.c | 23 ----------------------- 2 files changed, 24 deletions(-) delete mode 100644 mm/filemap_xip.c diff --git a/mm/Makefile b/mm/Makefile index 310c90a..454c176 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -47,7 +47,6 @@ obj-$(CONFIG_SLUB) += slub.o obj-$(CONFIG_KMEMCHECK) += kmemcheck.o obj-$(CONFIG_FAILSLAB) += failslab.o obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o -obj-$(CONFIG_FS_XIP) += filemap_xip.o obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c deleted file mode 100644 index 6316578..0000000 --- a/mm/filemap_xip.c +++ /dev/null @@ -1,23 +0,0 @@ -/* - * linux/mm/filemap_xip.c - * - * Copyright (C) 2005 IBM Corporation - * Author: Carsten Otte - * - * derived from linux/mm/filemap.c - Copyright (C) Linus Torvalds - * - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: [PATCH v7 06/22] Replace XIP read and write with DAX I/O Date: Sun, 23 Mar 2014 15:08:32 -0400 Message-ID: <3ebe329d8713f7db4c105021a845316a47a29797.1395591795.git.matthew.r.wilcox@intel.com> References: Cc: Matthew Wilcox , willy@linux.intel.com To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Use the generic AIO infrastructure instead of custom read and write methods. In addition to giving us support for AIO, this adds the missing locking between read() and truncate(). Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler --- fs/Makefile | 1 + fs/dax.c | 216 +++++++++++++++++++++++++++++++++++++++++++++++++ fs/ext2/file.c | 6 +- fs/ext2/inode.c | 7 +- include/linux/fs.h | 18 ++++- mm/filemap.c | 6 +- mm/filemap_xip.c | 234 ----------------------------------------------------- 7 files changed, 243 insertions(+), 245 deletions(-) create mode 100644 fs/dax.c diff --git a/fs/Makefile b/fs/Makefile index 47ac07b..2f194cd 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -29,6 +29,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o obj-$(CONFIG_TIMERFD) += timerfd.o obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_AIO) += aio.o +obj-$(CONFIG_FS_XIP) += dax.o obj-$(CONFIG_FILE_LOCKING) += locks.o obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout.o diff --git a/fs/dax.c b/fs/dax.c new file mode 100644 index 0000000..66a6bda --- /dev/null +++ b/fs/dax.c @@ -0,0 +1,216 @@ +/* + * fs/dax.c - Direct Access filesystem code + * Copyright (c) 2013-2014 Intel Corporation + * Author: Matthew Wilcox + * Author: Ross Zwisler + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ + +#include +#include +#include +#include +#include +#include +#include + +static long dax_get_addr(struct inode *inode, struct buffer_head *bh, + void **addr) +{ + struct block_device *bdev = bh->b_bdev; + const struct block_device_operations *ops = bdev->bd_disk->fops; + unsigned long pfn; + sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9); + return ops->direct_access(bdev, sector, addr, &pfn, bh->b_size); +} + +static void dax_new_buf(void *addr, unsigned size, unsigned first, + loff_t offset, loff_t end, int rw) +{ + loff_t final = end - offset + first; /* The final byte of the buffer */ + if (rw != WRITE) { + memset(addr, 0, size); + return; + } + + if (first > 0) + memset(addr, 0, first); + if (final < size) + memset(addr + final, 0, size - final); +} + +static bool buffer_written(struct buffer_head *bh) +{ + return buffer_mapped(bh) && !buffer_unwritten(bh); +} + +/* + * When ext4 encounters a hole, it likes to return without modifying the + * buffer_head which means that we can't trust b_size. To cope with this, + * we set b_state to 0 before calling get_block and, if any bit is set, we + * know we can trust b_size. Unfortunate, really, since ext4 does know + * precisely how long a hole is and would save us time calling get_block + * repeatedly. + */ +static bool buffer_size_valid(struct buffer_head *bh) +{ + return bh->b_state != 0; +} + +static ssize_t dax_io(int rw, struct inode *inode, const struct iovec *iov, + loff_t start, loff_t end, get_block_t get_block, + struct buffer_head *bh) +{ + ssize_t retval = 0; + unsigned seg = 0; + unsigned len; + unsigned copied = 0; + loff_t offset = start; + loff_t max = start; + loff_t bh_max = start; + void *addr; + bool hole = false; + + if (rw != WRITE) + end = min(end, i_size_read(inode)); + + while (offset < end) { + void __user *buf = iov[seg].iov_base + copied; + + if (offset == max) { + sector_t block = offset >> inode->i_blkbits; + unsigned first = offset - (block << inode->i_blkbits); + long size; + + if (offset == bh_max) { + bh->b_size = PAGE_ALIGN(end - offset); + bh->b_state = 0; + retval = get_block(inode, block, bh, + rw == WRITE); + if (retval) + break; + if (!buffer_size_valid(bh)) + bh->b_size = 1 << inode->i_blkbits; + bh_max = offset - first + bh->b_size; + } else { + unsigned done = bh->b_size - (bh_max - + (offset - first)); + bh->b_blocknr += done >> inode->i_blkbits; + bh->b_size -= done; + } + if (rw == WRITE) { + if (!buffer_mapped(bh)) { + retval = -EIO; + break; + } + hole = false; + } else { + hole = !buffer_written(bh); + } + + if (hole) { + addr = NULL; + size = bh->b_size - first; + } else { + retval = dax_get_addr(inode, bh, &addr); + if (retval < 0) + break; + if (buffer_unwritten(bh) || buffer_new(bh)) + dax_new_buf(addr, retval, first, + offset, end, rw); + addr += first; + size = retval - first; + } + max = min(offset + size, end); + } + + len = min_t(unsigned, iov[seg].iov_len - copied, max - offset); + + if (rw == WRITE) + len -= __copy_from_user_nocache(addr, buf, len); + else if (!hole) + len -= __copy_to_user(buf, addr, len); + else + len -= __clear_user(buf, len); + + if (!len) + break; + + offset += len; + copied += len; + addr += len; + if (copied == iov[seg].iov_len) { + seg++; + copied = 0; + } + } + + return (offset == start) ? retval : offset - start; +} + +/** + * dax_do_io - Perform I/O to a DAX file + * @rw: READ to read or WRITE to write + * @iocb: The control block for this I/O + * @inode: The file which the I/O is directed at + * @iov: The user addresses to do I/O from or to + * @offset: The file offset where the I/O starts + * @nr_segs: The length of the iov array + * @get_block: The filesystem method used to translate file offsets to blocks + * @end_io: A filesystem callback for I/O completion + * @flags: See below + * + * This function uses the same locking scheme as do_blockdev_direct_IO: + * If @flags has DIO_LOCKING set, we assume that the i_mutex is held by the + * caller for writes. For reads, we take and release the i_mutex ourselves. + * If DIO_LOCKING is not set, the filesystem takes care of its own locking. + * As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O + * is in progress. + */ +ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode, + const struct iovec *iov, loff_t offset, unsigned nr_segs, + get_block_t get_block, dio_iodone_t end_io, int flags) +{ + struct buffer_head bh; + unsigned seg; + ssize_t retval = -EINVAL; + loff_t end = offset; + + memset(&bh, 0, sizeof(bh)); + for (seg = 0; seg < nr_segs; seg++) + end += iov[seg].iov_len; + + if ((flags & DIO_LOCKING) && (rw == READ)) { + struct address_space *mapping = inode->i_mapping; + mutex_lock(&inode->i_mutex); + retval = filemap_write_and_wait_range(mapping, offset, end - 1); + if (retval) { + mutex_unlock(&inode->i_mutex); + goto out; + } + } + + /* Protects against truncate */ + atomic_inc(&inode->i_dio_count); + + retval = dax_io(rw, inode, iov, offset, end, get_block, &bh); + + if ((flags & DIO_LOCKING) && (rw == READ)) + mutex_unlock(&inode->i_mutex); + + inode_dio_done(inode); + + if ((retval > 0) && end_io) + end_io(iocb, offset, retval, bh.b_private); + out: + return retval; +} +EXPORT_SYMBOL_GPL(dax_do_io); diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 44c36e5..ef5cf96 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -81,8 +81,10 @@ const struct file_operations ext2_file_operations = { #ifdef CONFIG_EXT2_FS_XIP const struct file_operations ext2_xip_file_operations = { .llseek = generic_file_llseek, - .read = xip_file_read, - .write = xip_file_write, + .read = do_sync_read, + .write = do_sync_write, + .aio_read = generic_file_aio_read, + .aio_write = generic_file_aio_write, .unlocked_ioctl = ext2_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = ext2_compat_ioctl, diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index e7d3192..f128ebf 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -858,7 +858,11 @@ ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, struct inode *inode = mapping->host; ssize_t ret; - ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, + if (IS_DAX(inode)) + ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs, + ext2_get_block, NULL, DIO_LOCKING); + else + ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, ext2_get_block); if (ret < 0 && (rw & WRITE)) ext2_write_failed(mapping, offset + iov_length(iov, nr_segs)); @@ -888,6 +892,7 @@ const struct address_space_operations ext2_aops = { const struct address_space_operations ext2_aops_xip = { .bmap = ext2_bmap, .get_xip_mem = ext2_get_xip_mem, + .direct_IO = ext2_direct_IO, }; const struct address_space_operations ext2_nobh_aops = { diff --git a/include/linux/fs.h b/include/linux/fs.h index 47fd219..dabc601 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2521,17 +2521,22 @@ extern int generic_file_open(struct inode * inode, struct file * filp); extern int nonseekable_open(struct inode * inode, struct file * filp); #ifdef CONFIG_FS_XIP -extern ssize_t xip_file_read(struct file *filp, char __user *buf, size_t len, - loff_t *ppos); extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma); -extern ssize_t xip_file_write(struct file *filp, const char __user *buf, - size_t len, loff_t *ppos); extern int xip_truncate_page(struct address_space *mapping, loff_t from); +ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *, + loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags); #else static inline int xip_truncate_page(struct address_space *mapping, loff_t from) { return 0; } + +static inline ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode, + const struct iovec *iov, loff_t offset, unsigned nr_segs, + get_block_t get_block, dio_iodone_t end_io, int flags) +{ + return -ENOTTY; +} #endif #ifdef CONFIG_BLOCK @@ -2681,6 +2686,11 @@ extern int generic_show_options(struct seq_file *m, struct dentry *root); extern void save_mount_options(struct super_block *sb, char *options); extern void replace_mount_options(struct super_block *sb, char *options); +static inline bool io_is_direct(struct file *filp) +{ + return (filp->f_flags & O_DIRECT) || IS_DAX(file_inode(filp)); +} + static inline ino_t parent_ino(struct dentry *dentry) { ino_t res; diff --git a/mm/filemap.c b/mm/filemap.c index 7a13f6a..1b7dff6 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1417,8 +1417,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, if (retval) return retval; - /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */ - if (filp->f_flags & O_DIRECT) { + if (io_is_direct(filp)) { loff_t size; struct address_space *mapping; struct inode *inode; @@ -2468,8 +2467,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, if (err) goto out; - /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */ - if (unlikely(file->f_flags & O_DIRECT)) { + if (io_is_direct(file)) { loff_t endbyte; ssize_t written_buffered; diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c index c8d23e9..f7c37a1 100644 --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -42,119 +42,6 @@ static struct page *xip_sparse_page(void) } /* - * This is a file read routine for execute in place files, and uses - * the mapping->a_ops->get_xip_mem() function for the actual low-level - * stuff. - * - * Note the struct file* is not used at all. It may be NULL. - */ -static ssize_t -do_xip_mapping_read(struct address_space *mapping, - struct file_ra_state *_ra, - struct file *filp, - char __user *buf, - size_t len, - loff_t *ppos) -{ - struct inode *inode = mapping->host; - pgoff_t index, end_index; - unsigned long offset; - loff_t isize, pos; - size_t copied = 0, error = 0; - - BUG_ON(!mapping->a_ops->get_xip_mem); - - pos = *ppos; - index = pos >> PAGE_CACHE_SHIFT; - offset = pos & ~PAGE_CACHE_MASK; - - isize = i_size_read(inode); - if (!isize) - goto out; - - end_index = (isize - 1) >> PAGE_CACHE_SHIFT; - do { - unsigned long nr, left; - void *xip_mem; - unsigned long xip_pfn; - int zero = 0; - - /* nr is the maximum number of bytes to copy from this page */ - nr = PAGE_CACHE_SIZE; - if (index >= end_index) { - if (index > end_index) - goto out; - nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1; - if (nr <= offset) { - goto out; - } - } - nr = nr - offset; - if (nr > len - copied) - nr = len - copied; - - error = mapping->a_ops->get_xip_mem(mapping, index, 0, - &xip_mem, &xip_pfn); - if (unlikely(error)) { - if (error == -ENODATA) { - /* sparse */ - zero = 1; - } else - goto out; - } - - /* If users can be writing to this page using arbitrary - * virtual addresses, take care about potential aliasing - * before reading the page on the kernel side. - */ - if (mapping_writably_mapped(mapping)) - /* address based flush */ ; - - /* - * Ok, we have the mem, so now we can copy it to user space... - * - * The actor routine returns how many bytes were actually used.. - * NOTE! This may not be the same as how much of a user buffer - * we filled up (we may be padding etc), so we can only update - * "pos" here (the actor routine has to update the user buffer - * pointers and the remaining count). - */ - if (!zero) - left = __copy_to_user(buf+copied, xip_mem+offset, nr); - else - left = __clear_user(buf + copied, nr); - - if (left) { - error = -EFAULT; - goto out; - } - - copied += (nr - left); - offset += (nr - left); - index += offset >> PAGE_CACHE_SHIFT; - offset &= ~PAGE_CACHE_MASK; - } while (copied < len); - -out: - *ppos = pos + copied; - if (filp) - file_accessed(filp); - - return (copied ? copied : error); -} - -ssize_t -xip_file_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos) -{ - if (!access_ok(VERIFY_WRITE, buf, len)) - return -EFAULT; - - return do_xip_mapping_read(filp->f_mapping, &filp->f_ra, filp, - buf, len, ppos); -} -EXPORT_SYMBOL_GPL(xip_file_read); - -/* * __xip_unmap is invoked from xip_unmap and * xip_write * @@ -340,127 +227,6 @@ int xip_file_mmap(struct file * file, struct vm_area_struct * vma) } EXPORT_SYMBOL_GPL(xip_file_mmap); -static ssize_t -__xip_file_write(struct file *filp, const char __user *buf, - size_t count, loff_t pos, loff_t *ppos) -{ - struct address_space * mapping = filp->f_mapping; - const struct address_space_operations *a_ops = mapping->a_ops; - struct inode *inode = mapping->host; - long status = 0; - size_t bytes; - ssize_t written = 0; - - BUG_ON(!mapping->a_ops->get_xip_mem); - - do { - unsigned long index; - unsigned long offset; - size_t copied; - void *xip_mem; - unsigned long xip_pfn; - - offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */ - index = pos >> PAGE_CACHE_SHIFT; - bytes = PAGE_CACHE_SIZE - offset; - if (bytes > count) - bytes = count; - - status = a_ops->get_xip_mem(mapping, index, 0, - &xip_mem, &xip_pfn); - if (status == -ENODATA) { - /* we allocate a new page unmap it */ - mutex_lock(&xip_sparse_mutex); - status = a_ops->get_xip_mem(mapping, index, 1, - &xip_mem, &xip_pfn); - mutex_unlock(&xip_sparse_mutex); - if (!status) - /* unmap page at pgoff from all other vmas */ - __xip_unmap(mapping, index); - } - - if (status) - break; - - copied = bytes - - __copy_from_user_nocache(xip_mem + offset, buf, bytes); - - if (likely(copied > 0)) { - status = copied; - - if (status >= 0) { - written += status; - count -= status; - pos += status; - buf += status; - } - } - if (unlikely(copied != bytes)) - if (status >= 0) - status = -EFAULT; - if (status < 0) - break; - } while (count); - *ppos = pos; - /* - * No need to use i_size_read() here, the i_size - * cannot change under us because we hold i_mutex. - */ - if (pos > inode->i_size) { - i_size_write(inode, pos); - mark_inode_dirty(inode); - } - - return written ? written : status; -} - -ssize_t -xip_file_write(struct file *filp, const char __user *buf, size_t len, - loff_t *ppos) -{ - struct address_space *mapping = filp->f_mapping; - struct inode *inode = mapping->host; - size_t count; - loff_t pos; - ssize_t ret; - - mutex_lock(&inode->i_mutex); - - if (!access_ok(VERIFY_READ, buf, len)) { - ret=-EFAULT; - goto out_up; - } - - pos = *ppos; - count = len; - - /* We can write back this queue in page reclaim */ - current->backing_dev_info = mapping->backing_dev_info; - - ret = generic_write_checks(filp, &pos, &count, S_ISBLK(inode->i_mode)); - if (ret) - goto out_backing; - if (count == 0) - goto out_backing; - - ret = file_remove_suid(filp); - if (ret) - goto out_backing; - - ret = file_update_time(filp); - if (ret) - goto out_backing; - - ret = __xip_file_write (filp, buf, count, pos, ppos); - - out_backing: - current->backing_dev_info = NULL; - out_up: - mutex_unlock(&inode->i_mutex); - return ret; -} -EXPORT_SYMBOL_GPL(xip_file_write); - /* * truncate a page used for execute in place * functionality is analog to block_truncate_page but does use get_xip_mem -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: [PATCH v7 13/22] ext2: Remove ext2_use_xip Date: Sun, 23 Mar 2014 15:08:39 -0400 Message-ID: <0c65dcd599646e3054d0c524a0c5b25b07885763.1395591795.git.matthew.r.wilcox@intel.com> References: Cc: Matthew Wilcox , willy@linux.intel.com To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Replace ext2_use_xip() with test_opt(XIP) which expands to the same code Signed-off-by: Matthew Wilcox --- fs/ext2/ext2.h | 4 ++++ fs/ext2/inode.c | 2 +- fs/ext2/namei.c | 4 ++-- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h index d9a17d0..5ecf570 100644 --- a/fs/ext2/ext2.h +++ b/fs/ext2/ext2.h @@ -380,7 +380,11 @@ struct ext2_inode { #define EXT2_MOUNT_NO_UID32 0x000200 /* Disable 32-bit UIDs */ #define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */ #define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */ +#ifdef CONFIG_FS_XIP #define EXT2_MOUNT_XIP 0x010000 /* Execute in place */ +#else +#define EXT2_MOUNT_XIP 0 +#endif #define EXT2_MOUNT_USRQUOTA 0x020000 /* user quota */ #define EXT2_MOUNT_GRPQUOTA 0x040000 /* group quota */ #define EXT2_MOUNT_RESERVATION 0x080000 /* Preallocation */ diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index a9346a9..2e587e2 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -1393,7 +1393,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino) if (S_ISREG(inode->i_mode)) { inode->i_op = &ext2_file_inode_operations; - if (ext2_use_xip(inode->i_sb)) { + if (test_opt(inode->i_sb, XIP)) { inode->i_mapping->a_ops = &ext2_aops_xip; inode->i_fop = &ext2_xip_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c index c268d0a..846c356 100644 --- a/fs/ext2/namei.c +++ b/fs/ext2/namei.c @@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode return PTR_ERR(inode); inode->i_op = &ext2_file_inode_operations; - if (ext2_use_xip(inode->i_sb)) { + if (test_opt(inode->i_sb, XIP)) { inode->i_mapping->a_ops = &ext2_aops_xip; inode->i_fop = &ext2_xip_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { @@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) return PTR_ERR(inode); inode->i_op = &ext2_file_inode_operations; - if (ext2_use_xip(inode->i_sb)) { + if (test_opt(inode->i_sb, XIP)) { inode->i_mapping->a_ops = &ext2_aops_xip; inode->i_fop = &ext2_xip_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: [PATCH v7 08/22] Replace xip_truncate_page with dax_truncate_page Date: Sun, 23 Mar 2014 15:08:34 -0400 Message-ID: References: Cc: Matthew Wilcox , willy@linux.intel.com To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org It takes a get_block parameter just like nobh_truncate_page() and block_truncate_page() Signed-off-by: Matthew Wilcox --- fs/dax.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++---- fs/ext2/inode.c | 2 +- include/linux/fs.h | 4 ++-- mm/filemap_xip.c | 40 ---------------------------------------- 4 files changed, 51 insertions(+), 47 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 863749c..7271be0 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -374,13 +374,13 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, } /** - * dax_fault - handle a page fault on an XIP file + * dax_fault - handle a page fault on a DAX file * @vma: The virtual memory area where the fault occurred * @vmf: The description of the fault * @get_block: The filesystem method used to translate file offsets to blocks * * When a page fault occurs, filesystems may call this helper in their - * fault handler for XIP files. + * fault handler for DAX files. */ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, get_block_t get_block) @@ -398,12 +398,12 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, EXPORT_SYMBOL_GPL(dax_fault); /** - * dax_mkwrite - convert a read-only page to read-write in an XIP file + * dax_mkwrite - convert a read-only page to read-write in a DAX file * @vma: The virtual memory area where the fault occurred * @vmf: The description of the fault * @get_block: The filesystem method used to translate file offsets to blocks * - * XIP handles reads of holes by adding pages full of zeroes into the + * DAX handles reads of holes by adding pages full of zeroes into the * mapping. If the page is subsequenty written to, we have to allocate * the page on media and free the page that was in the cache. */ @@ -421,3 +421,47 @@ int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, return result; } EXPORT_SYMBOL_GPL(dax_mkwrite); + +/** + * dax_truncate_page - handle a partial page being truncated in a DAX file + * @inode: The file being truncated + * @from: The file offset that is being truncated to + * @get_block: The filesystem method used to translate file offsets to blocks + * + * Similar to block_truncate_page(), this function can be called by a + * filesystem when it is truncating an DAX file to handle the partial page. + * + * We work in terms of PAGE_CACHE_SIZE here for commonality with + * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem + * took care of disposing of the unnecessary blocks. Even if the filesystem + * block size is smaller than PAGE_SIZE, we have to zero the rest of the page + * since the file might be mmaped. + */ +int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) +{ + struct buffer_head bh; + pgoff_t index = from >> PAGE_CACHE_SHIFT; + unsigned offset = from & (PAGE_CACHE_SIZE-1); + unsigned length = PAGE_CACHE_ALIGN(from) - from; + int err; + + /* Block boundary? Nothing to do */ + if (!length) + return 0; + + memset(&bh, 0, sizeof(bh)); + bh.b_size = PAGE_CACHE_SIZE; + err = get_block(inode, index, &bh, 0); + if (err < 0) + return err; + if (buffer_written(&bh)) { + void *addr; + err = dax_get_addr(inode, &bh, &addr); + if (err) + return err; + memset(addr + offset, 0, length); + } + + return 0; +} +EXPORT_SYMBOL_GPL(dax_truncate_page); diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index f128ebf..252481f 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -1207,7 +1207,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize) inode_dio_wait(inode); if (IS_DAX(inode)) - error = xip_truncate_page(inode->i_mapping, newsize); + error = dax_truncate_page(inode, newsize, ext2_get_block); else if (test_opt(inode->i_sb, NOBH)) error = nobh_truncate_page(inode->i_mapping, newsize, ext2_get_block); diff --git a/include/linux/fs.h b/include/linux/fs.h index 1607812..9752ae5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2522,13 +2522,13 @@ extern int generic_file_open(struct inode * inode, struct file * filp); extern int nonseekable_open(struct inode * inode, struct file * filp); #ifdef CONFIG_FS_XIP -extern int xip_truncate_page(struct address_space *mapping, loff_t from); +int dax_truncate_page(struct inode *, loff_t from, get_block_t); ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *, loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags); int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t); int dax_mkwrite(struct vm_area_struct *, struct vm_fault *, get_block_t); #else -static inline int xip_truncate_page(struct address_space *mapping, loff_t from) +static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb) { return 0; } diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c index 9dd45f3..6316578 100644 --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -21,43 +21,3 @@ #include #include -/* - * truncate a page used for execute in place - * functionality is analog to block_truncate_page but does use get_xip_mem - * to get the page instead of page cache - */ -int -xip_truncate_page(struct address_space *mapping, loff_t from) -{ - pgoff_t index = from >> PAGE_CACHE_SHIFT; - unsigned offset = from & (PAGE_CACHE_SIZE-1); - unsigned blocksize; - unsigned length; - void *xip_mem; - unsigned long xip_pfn; - int err; - - BUG_ON(!mapping->a_ops->get_xip_mem); - - blocksize = 1 << mapping->host->i_blkbits; - length = offset & (blocksize - 1); - - /* Block boundary? Nothing to do */ - if (!length) - return 0; - - length = blocksize - length; - - err = mapping->a_ops->get_xip_mem(mapping, index, 0, - &xip_mem, &xip_pfn); - if (unlikely(err)) { - if (err == -ENODATA) - /* Hole? No need to truncate */ - return 0; - else - return err; - } - memset(xip_mem + offset, 0, length); - return 0; -} -EXPORT_SYMBOL_GPL(xip_truncate_page); -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: [PATCH v7 17/22] Get rid of most mentions of XIP in ext2 Date: Sun, 23 Mar 2014 15:08:43 -0400 Message-ID: <0b13a744db9bfca33938bc1576f7eb7bfc9c41c2.1395591795.git.matthew.r.wilcox@intel.com> References: Cc: Matthew Wilcox , willy@linux.intel.com To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org The only remaining usage is userspace's 'xip' option. --- fs/ext2/ext2.h | 6 +++--- fs/ext2/file.c | 2 +- fs/ext2/inode.c | 6 +++--- fs/ext2/namei.c | 8 ++++---- fs/ext2/super.c | 16 ++++++++-------- 5 files changed, 19 insertions(+), 19 deletions(-) diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h index b8b1c11..0e1fe9d 100644 --- a/fs/ext2/ext2.h +++ b/fs/ext2/ext2.h @@ -381,9 +381,9 @@ struct ext2_inode { #define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */ #define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */ #ifdef CONFIG_FS_DAX -#define EXT2_MOUNT_XIP 0x010000 /* Execute in place */ +#define EXT2_MOUNT_DAX 0x010000 /* Direct Access */ #else -#define EXT2_MOUNT_XIP 0 +#define EXT2_MOUNT_DAX 0 #endif #define EXT2_MOUNT_USRQUOTA 0x020000 /* user quota */ #define EXT2_MOUNT_GRPQUOTA 0x040000 /* group quota */ @@ -789,7 +789,7 @@ extern int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync); extern const struct inode_operations ext2_file_inode_operations; extern const struct file_operations ext2_file_operations; -extern const struct file_operations ext2_xip_file_operations; +extern const struct file_operations ext2_dax_file_operations; /* inode.c */ extern const struct address_space_operations ext2_aops; diff --git a/fs/ext2/file.c b/fs/ext2/file.c index ae7f000..f9bcb9b 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -110,7 +110,7 @@ const struct file_operations ext2_file_operations = { }; #ifdef CONFIG_FS_DAX -const struct file_operations ext2_xip_file_operations = { +const struct file_operations ext2_dax_file_operations = { .llseek = generic_file_llseek, .read = do_sync_read, .write = do_sync_write, diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 7ca76da..3776063 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -1285,7 +1285,7 @@ void ext2_set_inode_flags(struct inode *inode) inode->i_flags |= S_NOATIME; if (flags & EXT2_DIRSYNC_FL) inode->i_flags |= S_DIRSYNC; - if (test_opt(inode->i_sb, XIP)) + if (test_opt(inode->i_sb, DAX)) inode->i_flags |= S_DAX; } @@ -1387,9 +1387,9 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino) if (S_ISREG(inode->i_mode)) { inode->i_op = &ext2_file_inode_operations; - if (test_opt(inode->i_sb, XIP)) { + if (test_opt(inode->i_sb, DAX)) { inode->i_mapping->a_ops = &ext2_aops; - inode->i_fop = &ext2_xip_file_operations; + inode->i_fop = &ext2_dax_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; inode->i_fop = &ext2_file_operations; diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c index 0db888c..148f6e3 100644 --- a/fs/ext2/namei.c +++ b/fs/ext2/namei.c @@ -104,9 +104,9 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode return PTR_ERR(inode); inode->i_op = &ext2_file_inode_operations; - if (test_opt(inode->i_sb, XIP)) { + if (test_opt(inode->i_sb, DAX)) { inode->i_mapping->a_ops = &ext2_aops; - inode->i_fop = &ext2_xip_file_operations; + inode->i_fop = &ext2_dax_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; inode->i_fop = &ext2_file_operations; @@ -125,9 +125,9 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) return PTR_ERR(inode); inode->i_op = &ext2_file_inode_operations; - if (test_opt(inode->i_sb, XIP)) { + if (test_opt(inode->i_sb, DAX)) { inode->i_mapping->a_ops = &ext2_aops; - inode->i_fop = &ext2_xip_file_operations; + inode->i_fop = &ext2_dax_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; inode->i_fop = &ext2_file_operations; diff --git a/fs/ext2/super.c b/fs/ext2/super.c index fdcacf7..8062373 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -288,7 +288,7 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root) #endif #ifdef CONFIG_FS_DAX - if (sbi->s_mount_opt & EXT2_MOUNT_XIP) + if (sbi->s_mount_opt & EXT2_MOUNT_DAX) seq_puts(seq, ",xip"); #endif @@ -393,7 +393,7 @@ enum { Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic, Opt_err_ro, Opt_nouid32, Opt_nocheck, Opt_debug, Opt_oldalloc, Opt_orlov, Opt_nobh, Opt_user_xattr, Opt_nouser_xattr, - Opt_acl, Opt_noacl, Opt_xip, Opt_ignore, Opt_err, Opt_quota, + Opt_acl, Opt_noacl, Opt_dax, Opt_ignore, Opt_err, Opt_quota, Opt_usrquota, Opt_grpquota, Opt_reservation, Opt_noreservation }; @@ -421,7 +421,7 @@ static const match_table_t tokens = { {Opt_nouser_xattr, "nouser_xattr"}, {Opt_acl, "acl"}, {Opt_noacl, "noacl"}, - {Opt_xip, "xip"}, + {Opt_dax, "xip"}, {Opt_grpquota, "grpquota"}, {Opt_ignore, "noquota"}, {Opt_quota, "quota"}, @@ -548,9 +548,9 @@ static int parse_options(char *options, struct super_block *sb) "(no)acl options not supported"); break; #endif - case Opt_xip: + case Opt_dax: #ifdef CONFIG_FS_DAX - set_opt (sbi->s_mount_opt, XIP); + set_opt (sbi->s_mount_opt, DAX); #else ext2_msg(sb, KERN_INFO, "xip option not supported"); #endif @@ -896,7 +896,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size); - if (sbi->s_mount_opt & EXT2_MOUNT_XIP) { + if (sbi->s_mount_opt & EXT2_MOUNT_DAX) { if (blocksize != PAGE_SIZE) { ext2_msg(sb, KERN_ERR, "error: unsupported blocksize for xip"); @@ -1275,10 +1275,10 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data) ((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0); es = sbi->s_es; - if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) { + if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_DAX) { ext2_msg(sb, KERN_WARNING, "warning: refusing change of " "xip flag with busy inodes while remounting"); - sbi->s_mount_opt ^= EXT2_MOUNT_XIP; + sbi->s_mount_opt ^= EXT2_MOUNT_DAX; } if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) { spin_unlock(&sbi->s_lock); -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: [PATCH v7 18/22] xip: Add xip_zero_page_range Date: Sun, 23 Mar 2014 15:08:44 -0400 Message-ID: <5a87acda8c3e4d2b7ea5dd1249fcbf8be23b9645.1395591795.git.matthew.r.wilcox@intel.com> References: Cc: Matthew Wilcox , willy@linux.intel.com, Ross Zwisler To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org This new function allows us to support hole-punch for XIP files by zeroing a partial page, as opposed to the xip_truncate_page() function which can only truncate to the end of the page. Reimplement xip_truncate_page() as a macro that calls xip_zero_page_range(). Signed-off-by: Matthew Wilcox [ported to 3.13-rc2] Signed-off-by: Ross Zwisler --- Documentation/filesystems/dax.txt | 1 + fs/dax.c | 22 +++++++++++++++------- include/linux/fs.h | 9 ++++++++- 3 files changed, 24 insertions(+), 8 deletions(-) diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt index 06f84e5..e5706cc 100644 --- a/Documentation/filesystems/dax.txt +++ b/Documentation/filesystems/dax.txt @@ -55,6 +55,7 @@ Filesystem support consists of for fault and page_mkwrite (which should probably call dax_fault() and dax_mkwrite(), passing the appropriate get_block() callback) - calling dax_truncate_page() instead of block_truncate_page() for DAX files +- calling dax_zero_page_range() instead of zero_user() for DAX files - ensuring that there is sufficient locking between reads, writes, truncates and page faults diff --git a/fs/dax.c b/fs/dax.c index 45a0a41..2d6b4bc 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -457,13 +457,16 @@ int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, EXPORT_SYMBOL_GPL(dax_mkwrite); /** - * dax_truncate_page - handle a partial page being truncated in a DAX file + * dax_zero_page_range - zero a range within a page of a DAX file * @inode: The file being truncated * @from: The file offset that is being truncated to + * @length: The number of bytes to zero * @get_block: The filesystem method used to translate file offsets to blocks * - * Similar to block_truncate_page(), this function can be called by a - * filesystem when it is truncating an DAX file to handle the partial page. + * This function can be called by a filesystem when it is zeroing part of a + * page in a DAX file. This is intended for hole-punch operations. If + * you are truncating a file, the helper function dax_truncate_page() may be + * more convenient. * * We work in terms of PAGE_CACHE_SIZE here for commonality with * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem @@ -471,12 +474,12 @@ EXPORT_SYMBOL_GPL(dax_mkwrite); * block size is smaller than PAGE_SIZE, we have to zero the rest of the page * since the file might be mmaped. */ -int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) +int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length, + get_block_t get_block) { struct buffer_head bh; pgoff_t index = from >> PAGE_CACHE_SHIFT; unsigned offset = from & (PAGE_CACHE_SIZE-1); - unsigned length = PAGE_CACHE_ALIGN(from) - from; int err; /* Block boundary? Nothing to do */ @@ -491,11 +494,16 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) if (buffer_written(&bh)) { void *addr; err = dax_get_addr(inode, &bh, &addr); - if (err) + if (err < 0) return err; + /* + * ext4 sometimes asks to zero past the end of a block. It + * really just wants to zero to the end of the block. + */ + length = min_t(unsigned, length, PAGE_CACHE_SIZE - offset); memset(addr + offset, 0, length); } return 0; } -EXPORT_SYMBOL_GPL(dax_truncate_page); +EXPORT_SYMBOL_GPL(dax_zero_page_range); diff --git a/include/linux/fs.h b/include/linux/fs.h index bff394d..d0381ab 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2521,6 +2521,7 @@ extern int nonseekable_open(struct inode * inode, struct file * filp); #ifdef CONFIG_FS_DAX int dax_clear_blocks(struct inode *, sector_t block, long size); +int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t); int dax_truncate_page(struct inode *, loff_t from, get_block_t); ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *, loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags); @@ -2532,7 +2533,8 @@ static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz) return 0; } -static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb) +static inline int dax_zero_page_range(struct inode *inode, loff_t from, + unsigned len, get_block_t gb) { return 0; } @@ -2545,6 +2547,11 @@ static inline ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode, } #endif +/* Can't be a function because PAGE_CACHE_SIZE is defined in pagemap.h */ +#define dax_truncate_page(inode, from, get_block) \ + dax_zero_page_range(inode, from, PAGE_CACHE_SIZE, get_block) + + #ifdef CONFIG_BLOCK typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode, loff_t file_offset); -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: [PATCH v7 22/22] brd: Rename XIP to DAX Date: Sun, 23 Mar 2014 15:08:48 -0400 Message-ID: <7fd74703525f4077ed7c2b273ce6d082b03f0b61.1395591795.git.matthew.r.wilcox@intel.com> References: Cc: Matthew Wilcox , Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org From: Matthew Wilcox Since this is relating to FS_XIP, not KERNEL_XIP, it should be called DAX instead of XIP. Signed-off-by: Matthew Wilcox --- drivers/block/Kconfig | 13 +++++++------ drivers/block/brd.c | 14 +++++++------- fs/Kconfig | 4 ++-- 3 files changed, 16 insertions(+), 15 deletions(-) diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index 014a1cf..1b8094d 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -393,14 +393,15 @@ config BLK_DEV_RAM_SIZE The default value is 4096 kilobytes. Only change this if you know what you are doing. -config BLK_DEV_XIP - bool "Support XIP filesystems on RAM block device" - depends on BLK_DEV_RAM +config BLK_DEV_RAM_DAX + bool "Support Direct Access (DAX) to RAM block devices" + depends on BLK_DEV_RAM && FS_DAX default n help - Support XIP filesystems (such as ext2 with XIP support on) on - top of block ram device. This will slightly enlarge the kernel, and - will prevent RAM block device backing store memory from being + Support filesystems using DAX to access RAM block devices. This + avoids double-buffering data in the page cache before copying it + to the block device. Answering Y will slightly enlarge the kernel, + and will prevent RAM block device backing store memory from being allocated from highmem (only a problem for highmem systems). config CDROM_PKTCDVD diff --git a/drivers/block/brd.c b/drivers/block/brd.c index 00da60d..619e0e0 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -97,13 +97,13 @@ static struct page *brd_insert_page(struct brd_device *brd, sector_t sector) * Must use NOIO because we don't want to recurse back into the * block or filesystem layers from page reclaim. * - * Cannot support XIP and highmem, because our ->direct_access - * routine for XIP must return memory that is always addressable. - * If XIP was reworked to use pfns and kmap throughout, this + * Cannot support DAX and highmem, because our ->direct_access + * routine for DAX must return memory that is always addressable. + * If DAX was reworked to use pfns and kmap throughout, this * restriction might be able to be lifted. */ gfp_flags = GFP_NOIO | __GFP_ZERO; -#ifndef CONFIG_BLK_DEV_XIP +#ifndef CONFIG_BLK_DEV_RAM_DAX gfp_flags |= __GFP_HIGHMEM; #endif page = alloc_page(gfp_flags); @@ -360,7 +360,7 @@ out: bio_endio(bio, err); } -#ifdef CONFIG_BLK_DEV_XIP +#ifdef CONFIG_BLK_DEV_RAM_DAX static long brd_direct_access(struct block_device *bdev, sector_t sector, void **kaddr, unsigned long *pfn, long size) { @@ -383,6 +383,8 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector, * file is mapped to the next page of physical RAM */ return PAGE_SIZE; } +#else +#define brd_direct_access NULL #endif static int brd_ioctl(struct block_device *bdev, fmode_t mode, @@ -422,9 +424,7 @@ static int brd_ioctl(struct block_device *bdev, fmode_t mode, static const struct block_device_operations brd_fops = { .owner = THIS_MODULE, .ioctl = brd_ioctl, -#ifdef CONFIG_BLK_DEV_XIP .direct_access = brd_direct_access, -#endif }; /* diff --git a/fs/Kconfig b/fs/Kconfig index 620ab73..376bd0a 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -34,7 +34,7 @@ source "fs/btrfs/Kconfig" source "fs/nilfs2/Kconfig" config FS_DAX - bool "Direct Access support" + bool "Direct Access (DAX) support" depends on MMU help Direct Access (DAX) can be used on memory-backed block devices. @@ -45,7 +45,7 @@ config FS_DAX If you do not have a block device that is capable of using this, or if unsure, say N. Saying Y will increase the size of the kernel - by about 2kB. + by about 5kB. endif # BLOCK -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: [PATCH v7 16/22] ext2: Remove ext2_aops_xip Date: Sun, 23 Mar 2014 15:08:42 -0400 Message-ID: <0b6512aa46a504459f41d3c609fc20c93d4a911a.1395591795.git.matthew.r.wilcox@intel.com> References: Cc: Matthew Wilcox , willy@linux.intel.com To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org We shouldn't need a special address_space_operations any more Signed-off-by: Matthew Wilcox --- fs/ext2/ext2.h | 1 - fs/ext2/inode.c | 7 +------ fs/ext2/namei.c | 4 ++-- 3 files changed, 3 insertions(+), 9 deletions(-) diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h index b30c3bd..b8b1c11 100644 --- a/fs/ext2/ext2.h +++ b/fs/ext2/ext2.h @@ -793,7 +793,6 @@ extern const struct file_operations ext2_xip_file_operations; /* inode.c */ extern const struct address_space_operations ext2_aops; -extern const struct address_space_operations ext2_aops_xip; extern const struct address_space_operations ext2_nobh_aops; /* namei.c */ diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 67124f0..7ca76da 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -890,11 +890,6 @@ const struct address_space_operations ext2_aops = { .error_remove_page = generic_error_remove_page, }; -const struct address_space_operations ext2_aops_xip = { - .bmap = ext2_bmap, - .direct_IO = ext2_direct_IO, -}; - const struct address_space_operations ext2_nobh_aops = { .readpage = ext2_readpage, .readpages = ext2_readpages, @@ -1393,7 +1388,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino) if (S_ISREG(inode->i_mode)) { inode->i_op = &ext2_file_inode_operations; if (test_opt(inode->i_sb, XIP)) { - inode->i_mapping->a_ops = &ext2_aops_xip; + inode->i_mapping->a_ops = &ext2_aops; inode->i_fop = &ext2_xip_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c index 7ca803f..0db888c 100644 --- a/fs/ext2/namei.c +++ b/fs/ext2/namei.c @@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode inode->i_op = &ext2_file_inode_operations; if (test_opt(inode->i_sb, XIP)) { - inode->i_mapping->a_ops = &ext2_aops_xip; + inode->i_mapping->a_ops = &ext2_aops; inode->i_fop = &ext2_xip_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; @@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) inode->i_op = &ext2_file_inode_operations; if (test_opt(inode->i_sb, XIP)) { - inode->i_mapping->a_ops = &ext2_aops_xip; + inode->i_mapping->a_ops = &ext2_aops; inode->i_fop = &ext2_xip_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: [PATCH v7 20/22] ext4: Add DAX functionality Date: Sun, 23 Mar 2014 15:08:46 -0400 Message-ID: <490bf3041f0e0633964ca84bf4fb0bb3dd999694.1395591795.git.matthew.r.wilcox@intel.com> References: Cc: Ross Zwisler , willy@linux.intel.com, Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org From: Ross Zwisler This is a port of the DAX functionality found in the current version of ext2. Signed-off-by: Ross Zwisler Reviewed-by: Andreas Dilger [heavily tweaked] Signed-off-by: Matthew Wilcox --- Documentation/filesystems/dax.txt | 1 + Documentation/filesystems/ext4.txt | 2 ++ fs/ext4/ext4.h | 6 +++++ fs/ext4/file.c | 53 +++++++++++++++++++++++++++++++++----- fs/ext4/indirect.c | 19 +++++++++----- fs/ext4/inode.c | 52 ++++++++++++++++++++++++------------- fs/ext4/namei.c | 10 +++++-- fs/ext4/super.c | 39 +++++++++++++++++++++++++++- 8 files changed, 149 insertions(+), 33 deletions(-) diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt index e5706cc..619dab5 100644 --- a/Documentation/filesystems/dax.txt +++ b/Documentation/filesystems/dax.txt @@ -66,6 +66,7 @@ or a write()) work correctly. These filesystems may be used for inspiration: - ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt +- ext4: the fourth extended filesystem, see Documentation/filesystems/ext4.txt Shortcomings diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index 919a329..9c511c4 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt @@ -386,6 +386,8 @@ max_dir_size_kb=n This limits the size of directories so that any i_version Enable 64-bit inode version support. This option is off by default. +dax Use direct access if possible + Data Mode ========= There are 3 different data modes: diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index e025c29..00e9b79 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -966,6 +966,11 @@ struct ext4_inode_info { #define EXT4_MOUNT_ERRORS_MASK 0x00070 #define EXT4_MOUNT_MINIX_DF 0x00080 /* Mimics the Minix statfs */ #define EXT4_MOUNT_NOLOAD 0x00100 /* Don't use existing journal*/ +#ifdef CONFIG_FS_DAX +#define EXT4_MOUNT_DAX 0x00200 /* Execute in place */ +#else +#define EXT4_MOUNT_DAX 0 +#endif #define EXT4_MOUNT_DATA_FLAGS 0x00C00 /* Mode for data writes: */ #define EXT4_MOUNT_JOURNAL_DATA 0x00400 /* Write data to journal */ #define EXT4_MOUNT_ORDERED_DATA 0x00800 /* Flush data before commit */ @@ -2581,6 +2586,7 @@ extern const struct file_operations ext4_dir_operations; /* file.c */ extern const struct inode_operations ext4_file_inode_operations; extern const struct file_operations ext4_file_operations; +extern const struct file_operations ext4_dax_file_operations; extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin); extern void ext4_unwritten_wait(struct inode *inode); diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 1a50739..42a8ccd 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -190,7 +190,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov, } } - if (unlikely(iocb->ki_filp->f_flags & O_DIRECT)) + if (io_is_direct(iocb->ki_filp)) ret = ext4_file_dio_write(iocb, iov, nr_segs, pos); else ret = generic_file_aio_write(iocb, iov, nr_segs, pos); @@ -198,6 +198,27 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov, return ret; } +#ifdef CONFIG_FS_DAX +static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + return dax_fault(vma, vmf, ext4_get_block); + /* Is this the right get_block? */ +} + +static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + return dax_mkwrite(vma, vmf, ext4_get_block); +} + +static const struct vm_operations_struct ext4_dax_vm_ops = { + .fault = ext4_dax_fault, + .page_mkwrite = ext4_dax_mkwrite, + .remap_pages = generic_file_remap_pages, +}; +#else +#define ext4_dax_vm_ops ext4_file_vm_ops +#endif + static const struct vm_operations_struct ext4_file_vm_ops = { .fault = filemap_fault, .page_mkwrite = ext4_page_mkwrite, @@ -206,12 +227,13 @@ static const struct vm_operations_struct ext4_file_vm_ops = { static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) { - struct address_space *mapping = file->f_mapping; - - if (!mapping->a_ops->readpage) - return -ENOEXEC; file_accessed(file); - vma->vm_ops = &ext4_file_vm_ops; + if (IS_DAX(file_inode(file))) { + vma->vm_ops = &ext4_dax_vm_ops; + vma->vm_flags |= VM_MIXEDMAP; + } else { + vma->vm_ops = &ext4_file_vm_ops; + } return 0; } @@ -609,6 +631,25 @@ const struct file_operations ext4_file_operations = { .fallocate = ext4_fallocate, }; +#ifdef CONFIG_FS_DAX +const struct file_operations ext4_dax_file_operations = { + .llseek = ext4_llseek, + .read = do_sync_read, + .write = do_sync_write, + .aio_read = generic_file_aio_read, + .aio_write = ext4_file_write, + .unlocked_ioctl = ext4_ioctl, +#ifdef CONFIG_COMPAT + .compat_ioctl = ext4_compat_ioctl, +#endif + .mmap = ext4_file_mmap, + .open = ext4_file_open, + .release = ext4_release_file, + .fsync = ext4_sync_file, + .fallocate = ext4_fallocate, +}; +#endif + const struct inode_operations ext4_file_inode_operations = { .setattr = ext4_setattr, .getattr = ext4_getattr, diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c index 594009f..5fdb414 100644 --- a/fs/ext4/indirect.c +++ b/fs/ext4/indirect.c @@ -686,15 +686,22 @@ retry: inode_dio_done(inode); goto locked; } - ret = __blockdev_direct_IO(rw, iocb, inode, - inode->i_sb->s_bdev, iov, - offset, nr_segs, - ext4_get_block, NULL, NULL, 0); + if (IS_DAX(inode)) + ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs, + ext4_get_block, NULL, 0); + else + ret = __blockdev_direct_IO(rw, iocb, inode, + inode->i_sb->s_bdev, iov, offset, + nr_segs, ext4_get_block, NULL, NULL, 0); inode_dio_done(inode); } else { locked: - ret = blockdev_direct_IO(rw, iocb, inode, iov, - offset, nr_segs, ext4_get_block); + if (IS_DAX(inode)) + ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs, + ext4_get_block, NULL, DIO_LOCKING); + else + ret = blockdev_direct_IO(rw, iocb, inode, iov, + offset, nr_segs, ext4_get_block); if (unlikely((rw & WRITE) && ret < 0)) { loff_t isize = i_size_read(inode); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index ce7341c..9462730 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3140,13 +3140,14 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb, get_block_func = ext4_get_block_write; dio_flags = DIO_LOCKING; } - ret = __blockdev_direct_IO(rw, iocb, inode, - inode->i_sb->s_bdev, iov, - offset, nr_segs, - get_block_func, - ext4_end_io_dio, - NULL, - dio_flags); + if (IS_DAX(inode)) + ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs, + get_block_func, ext4_end_io_dio, dio_flags); + else + ret = __blockdev_direct_IO(rw, iocb, inode, + inode->i_sb->s_bdev, iov, offset, + nr_segs, get_block_func, + ext4_end_io_dio, NULL, dio_flags); /* * Put our reference to io_end. This can free the io_end structure e.g. @@ -3311,14 +3312,7 @@ void ext4_set_aops(struct inode *inode) inode->i_mapping->a_ops = &ext4_aops; } -/* - * ext4_block_zero_page_range() zeros out a mapping of length 'length' - * starting from file offset 'from'. The range to be zero'd must - * be contained with in one block. If the specified range exceeds - * the end of the block it will be shortened to end of the block - * that cooresponds to 'from' - */ -static int ext4_block_zero_page_range(handle_t *handle, +static int __ext4_block_zero_page_range(handle_t *handle, struct address_space *mapping, loff_t from, loff_t length) { ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT; @@ -3409,6 +3403,22 @@ unlock: } /* + * ext4_block_zero_page_range() zeros out a mapping of length 'length' + * starting from file offset 'from'. The range to be zero'd must + * be contained with in one block. If the specified range exceeds + * the end of the block it will be shortened to end of the block + * that cooresponds to 'from' + */ +static int ext4_block_zero_page_range(handle_t *handle, + struct address_space *mapping, loff_t from, loff_t length) +{ + struct inode *inode = mapping->host; + if (IS_DAX(inode)) + return dax_zero_page_range(inode, from, length, ext4_get_block); + return __ext4_block_zero_page_range(handle, mapping, from, length); +} + +/* * ext4_block_truncate_page() zeroes out a mapping from file offset `from' * up to the end of the block which corresponds to `from'. * This required during truncate. We need to physically zero the tail end @@ -3922,7 +3932,8 @@ void ext4_set_inode_flags(struct inode *inode) { unsigned int flags = EXT4_I(inode)->i_flags; - inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC); + inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | + S_DIRSYNC | S_DAX); if (flags & EXT4_SYNC_FL) inode->i_flags |= S_SYNC; if (flags & EXT4_APPEND_FL) @@ -3933,6 +3944,8 @@ void ext4_set_inode_flags(struct inode *inode) inode->i_flags |= S_NOATIME; if (flags & EXT4_DIRSYNC_FL) inode->i_flags |= S_DIRSYNC; + if (test_opt(inode->i_sb, DAX)) + inode->i_flags |= S_DAX; } /* Propagate flags from i_flags to EXT4_I(inode)->i_flags */ @@ -4184,7 +4197,10 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) if (S_ISREG(inode->i_mode)) { inode->i_op = &ext4_file_inode_operations; - inode->i_fop = &ext4_file_operations; + if (test_opt(inode->i_sb, DAX)) + inode->i_fop = &ext4_dax_file_operations; + else + inode->i_fop = &ext4_file_operations; ext4_set_aops(inode); } else if (S_ISDIR(inode->i_mode)) { inode->i_op = &ext4_dir_inode_operations; @@ -4640,7 +4656,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr) * Truncate pagecache after we've waited for commit * in data=journal mode to make pages freeable. */ - truncate_pagecache(inode, inode->i_size); + truncate_pagecache(inode, inode->i_size); } /* * We want to call ext4_truncate() even if attr->ia_size == diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index d050e04..acb9cca 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -2249,7 +2249,10 @@ retry: err = PTR_ERR(inode); if (!IS_ERR(inode)) { inode->i_op = &ext4_file_inode_operations; - inode->i_fop = &ext4_file_operations; + if (test_opt(inode->i_sb, DAX)) + inode->i_fop = &ext4_dax_file_operations; + else + inode->i_fop = &ext4_file_operations; ext4_set_aops(inode); err = ext4_add_nondir(handle, dentry, inode); if (!err && IS_DIRSYNC(dir)) @@ -2313,7 +2316,10 @@ retry: err = PTR_ERR(inode); if (!IS_ERR(inode)) { inode->i_op = &ext4_file_inode_operations; - inode->i_fop = &ext4_file_operations; + if (test_opt(inode->i_sb, DAX)) + inode->i_fop = &ext4_dax_file_operations; + else + inode->i_fop = &ext4_file_operations; ext4_set_aops(inode); d_tmpfile(dentry, inode); err = ext4_orphan_add(handle, inode); diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 710fed2..c0b7f4c 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1156,7 +1156,7 @@ enum { Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota, Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_jqfmt_vfsv1, Opt_quota, Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err, - Opt_usrquota, Opt_grpquota, Opt_i_version, + Opt_usrquota, Opt_grpquota, Opt_i_version, Opt_dax, Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit, Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity, Opt_inode_readahead_blks, Opt_journal_ioprio, @@ -1218,6 +1218,7 @@ static const match_table_t tokens = { {Opt_barrier, "barrier"}, {Opt_nobarrier, "nobarrier"}, {Opt_i_version, "i_version"}, + {Opt_dax, "dax"}, {Opt_stripe, "stripe=%u"}, {Opt_delalloc, "delalloc"}, {Opt_nodelalloc, "nodelalloc"}, @@ -1400,6 +1401,7 @@ static const struct mount_opts { {Opt_min_batch_time, 0, MOPT_GTE0}, {Opt_inode_readahead_blks, 0, MOPT_GTE0}, {Opt_init_itable, 0, MOPT_GTE0}, + {Opt_dax, EXT4_MOUNT_DAX, MOPT_SET}, {Opt_stripe, 0, MOPT_GTE0}, {Opt_resuid, 0, MOPT_GTE0}, {Opt_resgid, 0, MOPT_GTE0}, @@ -1638,6 +1640,11 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token, } sbi->s_jquota_fmt = m->mount_opt; #endif +#ifndef CONFIG_FS_DAX + } else if (token == Opt_dax) { + ext4_msg(sb, KERN_INFO, "dax option not supported"); + return -1; +#endif } else { if (!args->from) arg = 1; @@ -3560,6 +3567,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) "both data=journal and dioread_nolock"); goto failed_mount; } + if (test_opt(sb, DAX)) { + ext4_msg(sb, KERN_ERR, "can't mount with " + "both data=journal and dax"); + goto failed_mount; + } if (test_opt(sb, DELALLOC)) clear_opt(sb, DELALLOC); } @@ -3613,6 +3625,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) goto failed_mount; } + if (sbi->s_mount_opt & EXT4_MOUNT_DAX) { + if (blocksize != PAGE_SIZE) { + ext4_msg(sb, KERN_ERR, + "error: unsupported blocksize for dax"); + goto failed_mount; + } + if (!sb->s_bdev->bd_disk->fops->direct_access) { + ext4_msg(sb, KERN_ERR, + "error: device does not support dax"); + goto failed_mount; + } + } + if (sb->s_blocksize != blocksize) { /* Validate the filesystem blocksize */ if (!sb_set_blocksize(sb, blocksize)) { @@ -4813,6 +4838,18 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data) err = -EINVAL; goto restore_opts; } + if (test_opt(sb, DAX)) { + ext4_msg(sb, KERN_ERR, "can't mount with " + "both data=journal and dax"); + err = -EINVAL; + goto restore_opts; + } + } + + if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT4_MOUNT_DAX) { + ext4_msg(sb, KERN_WARNING, "warning: refusing change of " + "dax flag with busy inodes while remounting"); + sbi->s_mount_opt ^= EXT4_MOUNT_DAX; } if (sbi->s_mount_flags & EXT4_MF_FS_ABORTED) -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 04/22] Change direct_access calling convention Date: Sat, 29 Mar 2014 17:30:28 +0100 Message-ID: <20140329163028.GD1211@quack.suse.cz> References: <214af2a38d840d0b8e983d39d03711d1292bc2d6.1395591795.git.matthew.r.wilcox@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <214af2a38d840d0b8e983d39d03711d1292bc2d6.1395591795.git.matthew.r.wilcox@intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sun 23-03-14 15:08:30, Matthew Wilcox wrote: > In order to support accesses to larger chunks of memory, pass in a > 'size' parameter (counted in bytes), and return the amount available at > that address. > > Signed-off-by: Matthew Wilcox Two minor nits below. Other than that you can add: Reviewed-by: Jan Kara > --- > Documentation/filesystems/xip.txt | 15 +++++++++------ > arch/powerpc/sysdev/axonram.c | 6 +++--- > drivers/block/brd.c | 8 +++++--- > drivers/s390/block/dcssblk.c | 19 ++++++++++--------- > fs/ext2/xip.c | 30 +++++++++++++----------------- > include/linux/blkdev.h | 4 ++-- > 6 files changed, 42 insertions(+), 40 deletions(-) > ... > diff --git a/drivers/block/brd.c b/drivers/block/brd.c > index e73b85c..00da60d 100644 > --- a/drivers/block/brd.c > +++ b/drivers/block/brd.c > @@ -361,8 +361,8 @@ out: > } > > #ifdef CONFIG_BLK_DEV_XIP > -static int brd_direct_access(struct block_device *bdev, sector_t sector, > - void **kaddr, unsigned long *pfn) > +static long brd_direct_access(struct block_device *bdev, sector_t sector, > + void **kaddr, unsigned long *pfn, long size) > { > struct brd_device *brd = bdev->bd_disk->private_data; > struct page *page; > @@ -379,7 +379,9 @@ static int brd_direct_access(struct block_device *bdev, sector_t sector, > *kaddr = page_address(page); > *pfn = page_to_pfn(page); > > - return 0; > + /* Could optimistically check to see if the next page in the > + * file is mapped to the next page of physical RAM */ > + return PAGE_SIZE; This should be min_t(long, PAGE_SIZE, size), shouldn't it? > } > #endif > > diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c > index ebf41e2..da914b2 100644 > --- a/drivers/s390/block/dcssblk.c > +++ b/drivers/s390/block/dcssblk.c > @@ -28,8 +28,8 @@ > static int dcssblk_open(struct block_device *bdev, fmode_t mode); > static void dcssblk_release(struct gendisk *disk, fmode_t mode); > static void dcssblk_make_request(struct request_queue *q, struct bio *bio); > -static int dcssblk_direct_access(struct block_device *bdev, sector_t secnum, > - void **kaddr, unsigned long *pfn); > +static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum, > + void **kaddr, unsigned long *pfn, long size); > > static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0"; > > @@ -866,25 +866,26 @@ fail: > bio_io_error(bio); > } > > -static int > +static long > dcssblk_direct_access (struct block_device *bdev, sector_t secnum, > - void **kaddr, unsigned long *pfn) > + void **kaddr, unsigned long *pfn, long size) > { > struct dcssblk_dev_info *dev_info; > - unsigned long pgoff; > + unsigned long offset, dev_sz; > > dev_info = bdev->bd_disk->private_data; > if (!dev_info) > return -ENODEV; > + dev_sz = dev_info->end - dev_info->start; > if (secnum % (PAGE_SIZE/512)) > return -EINVAL; > - pgoff = secnum / (PAGE_SIZE / 512); > - if ((pgoff+1)*PAGE_SIZE-1 > dev_info->end - dev_info->start) > + offset = secnum * 512; > + if (offset > dev_sz) > return -ERANGE; > - *kaddr = (void *) (dev_info->start+pgoff*PAGE_SIZE); > + *kaddr = (void *) (dev_info->start + offset); > *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT; > > - return 0; > + return min_t(unsigned long, size, dev_sz - offset); ^^^ Why unsigned? Everything seems to be long... Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 03/22] axonram: Fix bug in direct_access Date: Sat, 29 Mar 2014 17:22:16 +0100 Message-ID: <20140329162216.GC1211@quack.suse.cz> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sun 23-03-14 15:08:29, Matthew Wilcox wrote: > The 'pfn' returned by axonram was completely bogus, and has been since > 2008. Maybe time to drop the driver instead? When noone noticed for 6 years, it seems pretty much dead... Or is there some possibility the driver can get reused for new HW? Anyway the patch looks correct so feel free to add: Reviewed-by: Jan Kara Honza > > Signed-off-by: Matthew Wilcox > --- > arch/powerpc/sysdev/axonram.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c > index 47b6b9f..830edc8 100644 > --- a/arch/powerpc/sysdev/axonram.c > +++ b/arch/powerpc/sysdev/axonram.c > @@ -156,7 +156,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector, > } > > *kaddr = (void *)(bank->ph_addr + offset); > - *pfn = virt_to_phys(kaddr) >> PAGE_SHIFT; > + *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT; > > return 0; > } > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 01/22] Fix XIP fault vs truncate race Date: Sat, 29 Mar 2014 16:57:24 +0100 Message-ID: <20140329155724.GB1211@quack.suse.cz> References: <59d73a58d4cfbe190a16ce912bb2776d9cc95447.1395591795.git.matthew.r.wilcox@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <59d73a58d4cfbe190a16ce912bb2776d9cc95447.1395591795.git.matthew.r.wilcox@intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sun 23-03-14 15:08:27, Matthew Wilcox wrote: > Pagecache faults recheck i_size after taking the page lock to ensure that > the fault didn't race against a truncate. We don't have a page to lock > in the XIP case, so use the i_mmap_mutex instead. It is locked in the > truncate path in unmap_mapping_range() after updating i_size. So while > we hold it in the fault path, we are guaranteed that either i_size has > already been updated in the truncate path, or that the truncate will > subsequently call zap_page_range_single() and so remove the mapping we > have just inserted. > > There is a window of time in which i_size has been reduced and the > thread has a mapping to a page which will be removed from the file, > but this is harmless as the page will not be allocated to a different > purpose before the thread's access to it is revoked. The patch looks good. You can add: Reviewed-by: Jan Kara Honza > Signed-off-by: Matthew Wilcox > --- > mm/filemap_xip.c | 24 ++++++++++++++++++++++-- > 1 file changed, 22 insertions(+), 2 deletions(-) > > diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c > index d8d9fe3..c8d23e9 100644 > --- a/mm/filemap_xip.c > +++ b/mm/filemap_xip.c > @@ -260,8 +260,17 @@ again: > __xip_unmap(mapping, vmf->pgoff); > > found: > + /* We must recheck i_size under i_mmap_mutex */ > + mutex_lock(&mapping->i_mmap_mutex); > + size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> > + PAGE_CACHE_SHIFT; > + if (unlikely(vmf->pgoff >= size)) { > + mutex_unlock(&mapping->i_mmap_mutex); > + return VM_FAULT_SIGBUS; > + } > err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address, > xip_pfn); > + mutex_unlock(&mapping->i_mmap_mutex); > if (err == -ENOMEM) > return VM_FAULT_OOM; > /* > @@ -285,16 +294,27 @@ found: > } > if (error != -ENODATA) > goto out; > + > + /* We must recheck i_size under i_mmap_mutex */ > + mutex_lock(&mapping->i_mmap_mutex); > + size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> > + PAGE_CACHE_SHIFT; > + if (unlikely(vmf->pgoff >= size)) { > + ret = VM_FAULT_SIGBUS; > + goto unlock; > + } > /* not shared and writable, use xip_sparse_page() */ > page = xip_sparse_page(); > if (!page) > - goto out; > + goto unlock; > err = vm_insert_page(vma, (unsigned long)vmf->virtual_address, > page); > if (err == -ENOMEM) > - goto out; > + goto unlock; > > ret = VM_FAULT_NOPAGE; > +unlock: > + mutex_unlock(&mapping->i_mmap_mutex); > out: > write_seqcount_end(&xip_sparse_seq); > mutex_unlock(&xip_sparse_mutex); > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH v7 04/22] Change direct_access calling convention Date: Wed, 2 Apr 2014 15:27:59 -0400 Message-ID: <20140402192759.GD27299@linux.intel.com> References: <214af2a38d840d0b8e983d39d03711d1292bc2d6.1395591795.git.matthew.r.wilcox@intel.com> <20140329163028.GD1211@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Jan Kara Return-path: Received: from mga02.intel.com ([134.134.136.20]:21929 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932993AbaDBTf4 (ORCPT ); Wed, 2 Apr 2014 15:35:56 -0400 Content-Disposition: inline In-Reply-To: <20140329163028.GD1211@quack.suse.cz> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Sat, Mar 29, 2014 at 05:30:28PM +0100, Jan Kara wrote: > > @@ -379,7 +379,9 @@ static int brd_direct_access(struct block_device *bdev, sector_t sector, > > *kaddr = page_address(page); > > *pfn = page_to_pfn(page); > > > > - return 0; > > + /* Could optimistically check to see if the next page in the > > + * file is mapped to the next page of physical RAM */ > > + return PAGE_SIZE; > This should be min_t(long, PAGE_SIZE, size), shouldn't it? Yes, it should. In practice, I don't think anyone's calling it with size < PAGE_SIZE, but we might as well future-proof it. > > @@ -866,25 +866,26 @@ fail: > > bio_io_error(bio); > > } > > > > -static int > > +static long > > dcssblk_direct_access (struct block_device *bdev, sector_t secnum, > > - void **kaddr, unsigned long *pfn) > > + void **kaddr, unsigned long *pfn, long size) > > { > > struct dcssblk_dev_info *dev_info; > > - unsigned long pgoff; > > + unsigned long offset, dev_sz; > > - return 0; > > + return min_t(unsigned long, size, dev_sz - offset); > ^^^ Why unsigned? Everything seems to be long... offset is unsigned long ... but might as well do the comparison in signed as unsigned. 'size' shouldn't be passed in as < 0 anyway. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH v7 03/22] axonram: Fix bug in direct_access Date: Wed, 2 Apr 2014 15:24:46 -0400 Message-ID: <20140402192446.GC27299@linux.intel.com> References: <20140329162216.GC1211@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Jan Kara Return-path: Content-Disposition: inline In-Reply-To: <20140329162216.GC1211@quack.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org On Sat, Mar 29, 2014 at 05:22:16PM +0100, Jan Kara wrote: > On Sun 23-03-14 15:08:29, Matthew Wilcox wrote: > > The 'pfn' returned by axonram was completely bogus, and has been since > > 2008. > Maybe time to drop the driver instead? When noone noticed for 6 years, it > seems pretty much dead... Or is there some possibility the driver can get > reused for new HW? It may be in use, just not with the -o xip option to ext2 ... I can't find out which of the various vendors on the internet that are called 'Axon' that this device was originally supposed to support. I suspect it's dead, since it's DDR-2, but *shrug*, it costs little to fix it. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 05/22] Introduce IS_DAX(inode) Date: Tue, 8 Apr 2014 17:32:55 +0200 Message-ID: <20140408153255.GC2713@quack.suse.cz> References: <6a8918c9a0fb37882179e3699b3e04d96540b24f.1395591795.git.matthew.r.wilcox@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <6a8918c9a0fb37882179e3699b3e04d96540b24f.1395591795.git.matthew.r.wilcox@intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sun 23-03-14 15:08:31, Matthew Wilcox wrote: > Use an inode flag to tag inodes which should avoid using the page cache. > Convert ext2 to use it instead of mapping_is_xip(). The patch looks good. You can add: Reviewed-by: Jan Kara Honza > > Signed-off-by: Matthew Wilcox > --- > fs/ext2/inode.c | 9 ++++++--- > fs/ext2/xip.h | 2 -- > include/linux/fs.h | 6 ++++++ > 3 files changed, 12 insertions(+), 5 deletions(-) > > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index 94ed3684..e7d3192 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -731,7 +731,7 @@ static int ext2_get_blocks(struct inode *inode, > goto cleanup; > } > > - if (ext2_use_xip(inode->i_sb)) { > + if (IS_DAX(inode)) { > /* > * we need to clear the block > */ > @@ -1201,7 +1201,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize) > > inode_dio_wait(inode); > > - if (mapping_is_xip(inode->i_mapping)) > + if (IS_DAX(inode)) > error = xip_truncate_page(inode->i_mapping, newsize); > else if (test_opt(inode->i_sb, NOBH)) > error = nobh_truncate_page(inode->i_mapping, > @@ -1273,7 +1273,8 @@ void ext2_set_inode_flags(struct inode *inode) > { > unsigned int flags = EXT2_I(inode)->i_flags; > > - inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC); > + inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | > + S_DIRSYNC | S_DAX); > if (flags & EXT2_SYNC_FL) > inode->i_flags |= S_SYNC; > if (flags & EXT2_APPEND_FL) > @@ -1284,6 +1285,8 @@ void ext2_set_inode_flags(struct inode *inode) > inode->i_flags |= S_NOATIME; > if (flags & EXT2_DIRSYNC_FL) > inode->i_flags |= S_DIRSYNC; > + if (test_opt(inode->i_sb, XIP)) > + inode->i_flags |= S_DAX; > } > > /* Propagate flags from i_flags to EXT2_I(inode)->i_flags */ > diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h > index 18b34d2..29be737 100644 > --- a/fs/ext2/xip.h > +++ b/fs/ext2/xip.h > @@ -16,9 +16,7 @@ static inline int ext2_use_xip (struct super_block *sb) > } > int ext2_get_xip_mem(struct address_space *, pgoff_t, int, > void **, unsigned long *); > -#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_mem) > #else > -#define mapping_is_xip(map) 0 > #define ext2_xip_verify_sb(sb) do { } while (0) > #define ext2_use_xip(sb) 0 > #define ext2_clear_xip_target(inode, chain) 0 > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 23b2a35..47fd219 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1644,6 +1644,7 @@ struct super_operations { > #define S_IMA 1024 /* Inode has an associated IMA struct */ > #define S_AUTOMOUNT 2048 /* Automount/referral quasi-directory */ > #define S_NOSEC 4096 /* no suid or xattr security attributes */ > +#define S_DAX 8192 /* Direct Access, avoiding the page cache */ > > /* > * Note that nosuid etc flags are inode-specific: setting some file-system > @@ -1681,6 +1682,11 @@ struct super_operations { > #define IS_IMA(inode) ((inode)->i_flags & S_IMA) > #define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT) > #define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC) > +#ifdef CONFIG_FS_XIP > +#define IS_DAX(inode) ((inode)->i_flags & S_DAX) > +#else > +#define IS_DAX(inode) 0 > +#endif > > /* > * Inode state bits. Protected by inode->i_lock > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 02/22] Allow page fault handlers to perform the COW Date: Tue, 8 Apr 2014 18:34:57 +0200 Message-ID: <20140408163457.GD2713@quack.suse.cz> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sun 23-03-14 15:08:28, Matthew Wilcox wrote: > Currently COW of an XIP file is done by first bringing in a read-only > mapping, then retrying the fault and copying the page. It is much more > efficient to tell the fault handler that a COW is being attempted (by > passing in the pre-allocated page in the vm_fault structure), and allow > the handler to perform the COW operation itself. > > Where the filemap code protects against truncation of the file until > the PTE has been installed with the page lock, the XIP code use the > i_mmap_mutex instead. We must therefore unlock the i_mmap_mutex after > inserting the PTE. Eww, leaking of locking details about DAX into generic fault code is really ugly. It seems to me that once you pass the cow_page into the fault handler (which looks OK to me), you can just directly install it in PTE via vm_insert_page() and you don't have to rely on do_cow_fault() for that. Thus you can return VM_FAULT_NOPAGE and be done with it? Basically cow faults will then work the same way as other faults for DAX... Or am I missing something? Honza > Signed-off-by: Matthew Wilcox > --- > include/linux/mm.h | 2 ++ > mm/memory.c | 45 +++++++++++++++++++++++++++++++++------------ > 2 files changed, 35 insertions(+), 12 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index c1b7414..513b78a 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -205,6 +205,7 @@ struct vm_fault { > pgoff_t pgoff; /* Logical page offset based on vma */ > void __user *virtual_address; /* Faulting virtual address */ > > + struct page *cow_page; /* Handler may choose to COW */ > struct page *page; /* ->fault handlers should return a > * page here, unless VM_FAULT_NOPAGE > * is set (which is also implied by > @@ -1010,6 +1011,7 @@ static inline int page_mapped(struct page *page) > #define VM_FAULT_HWPOISON 0x0010 /* Hit poisoned small page */ > #define VM_FAULT_HWPOISON_LARGE 0x0020 /* Hit poisoned large page. Index encoded in upper bits */ > > +#define VM_FAULT_COWED 0x0080 /* ->fault COWed the page instead */ > #define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */ > #define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */ > #define VM_FAULT_RETRY 0x0400 /* ->fault blocked, must retry */ > diff --git a/mm/memory.c b/mm/memory.c > index 07b4287..2a2ecac 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2602,6 +2602,7 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page, > vmf.pgoff = page->index; > vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE; > vmf.page = page; > + vmf.cow_page = NULL; > > ret = vma->vm_ops->page_mkwrite(vma, &vmf); > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) > @@ -3288,7 +3289,8 @@ oom: > } > > static int __do_fault(struct vm_area_struct *vma, unsigned long address, > - pgoff_t pgoff, unsigned int flags, struct page **page) > + pgoff_t pgoff, unsigned int flags, > + struct page *cow_page, struct page **page) > { > struct vm_fault vmf; > int ret; > @@ -3297,10 +3299,13 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address, > vmf.pgoff = pgoff; > vmf.flags = flags; > vmf.page = NULL; > + vmf.cow_page = cow_page; > > ret = vma->vm_ops->fault(vma, &vmf); > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) > return ret; > + if (unlikely(ret & VM_FAULT_COWED)) > + goto out; > > if (unlikely(PageHWPoison(vmf.page))) { > if (ret & VM_FAULT_LOCKED) > @@ -3314,6 +3319,7 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address, > else > VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page); > > + out: > *page = vmf.page; > return ret; > } > @@ -3351,7 +3357,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma, > pte_t *pte; > int ret; > > - ret = __do_fault(vma, address, pgoff, flags, &fault_page); > + ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page); > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) > return ret; > > @@ -3368,6 +3374,12 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma, > return ret; > } > > +/* > + * If the fault handler performs the COW, it does not return a page, > + * so cannot use the page's lock to protect against a concurrent truncate > + * operation. Instead it returns with the i_mmap_mutex held, which must > + * be released after the PTE has been inserted. > + */ > static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma, > unsigned long address, pmd_t *pmd, > pgoff_t pgoff, unsigned int flags, pte_t orig_pte) > @@ -3389,25 +3401,34 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma, > return VM_FAULT_OOM; > } > > - ret = __do_fault(vma, address, pgoff, flags, &fault_page); > + ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page); > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) > goto uncharge_out; > > - copy_user_highpage(new_page, fault_page, address, vma); > + if (!(ret & VM_FAULT_COWED)) > + copy_user_highpage(new_page, fault_page, address, vma); > __SetPageUptodate(new_page); > > pte = pte_offset_map_lock(mm, pmd, address, &ptl); > - if (unlikely(!pte_same(*pte, orig_pte))) { > - pte_unmap_unlock(pte, ptl); > + if (unlikely(!pte_same(*pte, orig_pte))) > + goto unlock_out; > + do_set_pte(vma, address, new_page, pte, true, true); > + pte_unmap_unlock(pte, ptl); > + if (ret & VM_FAULT_COWED) { > + mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex); > + } else { > unlock_page(fault_page); > page_cache_release(fault_page); > - goto uncharge_out; > } > - do_set_pte(vma, address, new_page, pte, true, true); > - pte_unmap_unlock(pte, ptl); > - unlock_page(fault_page); > - page_cache_release(fault_page); > return ret; > +unlock_out: > + pte_unmap_unlock(pte, ptl); > + if (ret & VM_FAULT_COWED) { > + mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex); > + } else { > + unlock_page(fault_page); > + page_cache_release(fault_page); > + } > uncharge_out: > mem_cgroup_uncharge_page(new_page); > page_cache_release(new_page); > @@ -3424,7 +3445,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma, > int dirtied = 0; > int ret, tmp; > > - ret = __do_fault(vma, address, pgoff, flags, &fault_page); > + ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page); > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) > return ret; > > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 06/22] Replace XIP read and write with DAX I/O Date: Tue, 8 Apr 2014 19:56:00 +0200 Message-ID: <20140408175600.GE2713@quack.suse.cz> References: <3ebe329d8713f7db4c105021a845316a47a29797.1395591795.git.matthew.r.wilcox@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <3ebe329d8713f7db4c105021a845316a47a29797.1395591795.git.matthew.r.wilcox@intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sun 23-03-14 15:08:32, Matthew Wilcox wrote: > Use the generic AIO infrastructure instead of custom read and write > methods. In addition to giving us support for AIO, this adds the missing > locking between read() and truncate(). > > Signed-off-by: Matthew Wilcox > Reviewed-by: Ross Zwisler In general this looks fine but I have some comments below. > --- > fs/Makefile | 1 + > fs/dax.c | 216 +++++++++++++++++++++++++++++++++++++++++++++++++ > fs/ext2/file.c | 6 +- > fs/ext2/inode.c | 7 +- > include/linux/fs.h | 18 ++++- > mm/filemap.c | 6 +- > mm/filemap_xip.c | 234 ----------------------------------------------------- > 7 files changed, 243 insertions(+), 245 deletions(-) > create mode 100644 fs/dax.c > > diff --git a/fs/Makefile b/fs/Makefile > index 47ac07b..2f194cd 100644 > --- a/fs/Makefile > +++ b/fs/Makefile > @@ -29,6 +29,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o > obj-$(CONFIG_TIMERFD) += timerfd.o > obj-$(CONFIG_EVENTFD) += eventfd.o > obj-$(CONFIG_AIO) += aio.o > +obj-$(CONFIG_FS_XIP) += dax.o > obj-$(CONFIG_FILE_LOCKING) += locks.o > obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o > obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout.o > diff --git a/fs/dax.c b/fs/dax.c > new file mode 100644 > index 0000000..66a6bda > --- /dev/null > +++ b/fs/dax.c > @@ -0,0 +1,216 @@ > +/* > + * fs/dax.c - Direct Access filesystem code > + * Copyright (c) 2013-2014 Intel Corporation > + * Author: Matthew Wilcox > + * Author: Ross Zwisler > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms and conditions of the GNU General Public License, > + * version 2, as published by the Free Software Foundation. > + * > + * This program is distributed in the hope it will be useful, but WITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + * more details. > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +static long dax_get_addr(struct inode *inode, struct buffer_head *bh, > + void **addr) > +{ > + struct block_device *bdev = bh->b_bdev; > + const struct block_device_operations *ops = bdev->bd_disk->fops; > + unsigned long pfn; > + sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9); > + return ops->direct_access(bdev, sector, addr, &pfn, bh->b_size); > +} > + > +static void dax_new_buf(void *addr, unsigned size, unsigned first, > + loff_t offset, loff_t end, int rw) > +{ > + loff_t final = end - offset + first; /* The final byte of the buffer */ > + if (rw != WRITE) { > + memset(addr, 0, size); > + return; > + } It seems counterintuitive to zero out "on-disk" blocks (it seems you'd do this for unwritten blocks) when reading from them. Presumably it could also have undesired effects on endurance of persistent memory. Instead I'd expect that you simply zero out user provided buffer the same way as you do it for holes. > + > + if (first > 0) > + memset(addr, 0, first); > + if (final < size) > + memset(addr + final, 0, size - final); > +} > + > +static bool buffer_written(struct buffer_head *bh) > +{ > + return buffer_mapped(bh) && !buffer_unwritten(bh); > +} > + > +/* > + * When ext4 encounters a hole, it likes to return without modifying the > + * buffer_head which means that we can't trust b_size. To cope with this, > + * we set b_state to 0 before calling get_block and, if any bit is set, we > + * know we can trust b_size. Unfortunate, really, since ext4 does know > + * precisely how long a hole is and would save us time calling get_block > + * repeatedly. Well, this is really a problem of get_blocks() returning the result in struct buffer_head which is used for input as well. I don't think it is actually ext4 specific. > + */ > +static bool buffer_size_valid(struct buffer_head *bh) > +{ > + return bh->b_state != 0; > +} > + > +static ssize_t dax_io(int rw, struct inode *inode, const struct iovec *iov, > + loff_t start, loff_t end, get_block_t get_block, > + struct buffer_head *bh) > +{ > + ssize_t retval = 0; > + unsigned seg = 0; > + unsigned len; > + unsigned copied = 0; > + loff_t offset = start; > + loff_t max = start; > + loff_t bh_max = start; > + void *addr; > + bool hole = false; > + > + if (rw != WRITE) > + end = min(end, i_size_read(inode)); > + > + while (offset < end) { > + void __user *buf = iov[seg].iov_base + copied; > + > + if (offset == max) { > + sector_t block = offset >> inode->i_blkbits; > + unsigned first = offset - (block << inode->i_blkbits); > + long size; > + > + if (offset == bh_max) { > + bh->b_size = PAGE_ALIGN(end - offset); > + bh->b_state = 0; > + retval = get_block(inode, block, bh, > + rw == WRITE); > + if (retval) > + break; > + if (!buffer_size_valid(bh)) > + bh->b_size = 1 << inode->i_blkbits; > + bh_max = offset - first + bh->b_size; > + } else { > + unsigned done = bh->b_size - (bh_max - > + (offset - first)); > + bh->b_blocknr += done >> inode->i_blkbits; > + bh->b_size -= done; It took me quite some time to figure out what this does and whether it is correct :). Why isn't this at the place where we advance all other iterators like offset, addr, etc.? > + } > + if (rw == WRITE) { > + if (!buffer_mapped(bh)) { > + retval = -EIO; > + break; -EIO looks like a wrong error here. Or maybe it is the right one and it only needs some explanation? The thing is that for direct IO some filesystems choose not to fill holes for direct IO and fall back to buffered IO instead (to avoid exposure of uninitialized blocks if the system crashes after blocks have been added to a file but before they were written out). For DAX you are pretty much free to define what you ask from the get_blocks() (and this fallback behavior is somewhat disputed behavior in direct IO case so you might want to differ here) but you should document it somewhere. > + } > + hole = false; > + } else { > + hole = !buffer_written(bh); > + } > + > + if (hole) { > + addr = NULL; > + size = bh->b_size - first; > + } else { > + retval = dax_get_addr(inode, bh, &addr); > + if (retval < 0) > + break; > + if (buffer_unwritten(bh) || buffer_new(bh)) > + dax_new_buf(addr, retval, first, > + offset, end, rw); > + addr += first; > + size = retval - first; > + } > + max = min(offset + size, end); > + } > + > + len = min_t(unsigned, iov[seg].iov_len - copied, max - offset); > + > + if (rw == WRITE) > + len -= __copy_from_user_nocache(addr, buf, len); > + else if (!hole) > + len -= __copy_to_user(buf, addr, len); > + else > + len -= __clear_user(buf, len); > + > + if (!len) > + break; > + > + offset += len; > + copied += len; > + addr += len; > + if (copied == iov[seg].iov_len) { > + seg++; > + copied = 0; > + } > + } > + > + return (offset == start) ? retval : offset - start; > +} > + > +/** > + * dax_do_io - Perform I/O to a DAX file > + * @rw: READ to read or WRITE to write > + * @iocb: The control block for this I/O > + * @inode: The file which the I/O is directed at > + * @iov: The user addresses to do I/O from or to > + * @offset: The file offset where the I/O starts > + * @nr_segs: The length of the iov array > + * @get_block: The filesystem method used to translate file offsets to blocks > + * @end_io: A filesystem callback for I/O completion > + * @flags: See below > + * > + * This function uses the same locking scheme as do_blockdev_direct_IO: > + * If @flags has DIO_LOCKING set, we assume that the i_mutex is held by the > + * caller for writes. For reads, we take and release the i_mutex ourselves. > + * If DIO_LOCKING is not set, the filesystem takes care of its own locking. > + * As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O > + * is in progress. > + */ > +ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode, > + const struct iovec *iov, loff_t offset, unsigned nr_segs, > + get_block_t get_block, dio_iodone_t end_io, int flags) > +{ > + struct buffer_head bh; > + unsigned seg; > + ssize_t retval = -EINVAL; > + loff_t end = offset; > + > + memset(&bh, 0, sizeof(bh)); > + for (seg = 0; seg < nr_segs; seg++) > + end += iov[seg].iov_len; > + > + if ((flags & DIO_LOCKING) && (rw == READ)) { > + struct address_space *mapping = inode->i_mapping; > + mutex_lock(&inode->i_mutex); > + retval = filemap_write_and_wait_range(mapping, offset, end - 1); > + if (retval) { > + mutex_unlock(&inode->i_mutex); > + goto out; > + } Is there a reason for this? I'd assume DAX has no pages in pagecache... > + } > + > + /* Protects against truncate */ > + atomic_inc(&inode->i_dio_count); > + > + retval = dax_io(rw, inode, iov, offset, end, get_block, &bh); > + > + if ((flags & DIO_LOCKING) && (rw == READ)) > + mutex_unlock(&inode->i_mutex); > + > + inode_dio_done(inode); > + > + if ((retval > 0) && end_io) > + end_io(iocb, offset, retval, bh.b_private); > + out: > + return retval; > +} > +EXPORT_SYMBOL_GPL(dax_do_io); > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > index 44c36e5..ef5cf96 100644 > --- a/fs/ext2/file.c > +++ b/fs/ext2/file.c > @@ -81,8 +81,10 @@ const struct file_operations ext2_file_operations = { > #ifdef CONFIG_EXT2_FS_XIP > const struct file_operations ext2_xip_file_operations = { > .llseek = generic_file_llseek, > - .read = xip_file_read, > - .write = xip_file_write, > + .read = do_sync_read, > + .write = do_sync_write, > + .aio_read = generic_file_aio_read, > + .aio_write = generic_file_aio_write, > .unlocked_ioctl = ext2_ioctl, > #ifdef CONFIG_COMPAT > .compat_ioctl = ext2_compat_ioctl, > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index e7d3192..f128ebf 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -858,7 +858,11 @@ ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, > struct inode *inode = mapping->host; > ssize_t ret; > > - ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, > + if (IS_DAX(inode)) > + ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs, > + ext2_get_block, NULL, DIO_LOCKING); > + else > + ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, > ext2_get_block); I'd somewhat prefer to have a ext2_direct_IO() as is and have ext2_dax_IO() call only dax_do_io() (and use that as .direct_io in ext2_aops_xip). Then there's no need to check IS_DAX() and the code would look more obvious to me. But I don't feel strongly about it. > if (ret < 0 && (rw & WRITE)) > ext2_write_failed(mapping, offset + iov_length(iov, nr_segs)); > @@ -888,6 +892,7 @@ const struct address_space_operations ext2_aops = { > const struct address_space_operations ext2_aops_xip = { > .bmap = ext2_bmap, > .get_xip_mem = ext2_get_xip_mem, > + .direct_IO = ext2_direct_IO, > }; > > const struct address_space_operations ext2_nobh_aops = { > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 47fd219..dabc601 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2521,17 +2521,22 @@ extern int generic_file_open(struct inode * inode, struct file * filp); > extern int nonseekable_open(struct inode * inode, struct file * filp); > > #ifdef CONFIG_FS_XIP > -extern ssize_t xip_file_read(struct file *filp, char __user *buf, size_t len, > - loff_t *ppos); > extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma); > -extern ssize_t xip_file_write(struct file *filp, const char __user *buf, > - size_t len, loff_t *ppos); > extern int xip_truncate_page(struct address_space *mapping, loff_t from); > +ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *, > + loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags); > #else > static inline int xip_truncate_page(struct address_space *mapping, loff_t from) > { > return 0; > } > + > +static inline ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode, > + const struct iovec *iov, loff_t offset, unsigned nr_segs, > + get_block_t get_block, dio_iodone_t end_io, int flags) > +{ > + return -ENOTTY; Huh, ENOTTY? I'd expect EOPNOTSUPP or something like that... > +} > #endif > > #ifdef CONFIG_BLOCK > @@ -2681,6 +2686,11 @@ extern int generic_show_options(struct seq_file *m, struct dentry *root); > extern void save_mount_options(struct super_block *sb, char *options); > extern void replace_mount_options(struct super_block *sb, char *options); > > +static inline bool io_is_direct(struct file *filp) > +{ > + return (filp->f_flags & O_DIRECT) || IS_DAX(file_inode(filp)); > +} > + BTW: It seems fs/open.c: open_check_o_direct() can be simplified to not check for get_xip_mem(), cannot it? > static inline ino_t parent_ino(struct dentry *dentry) > { > ino_t res; > diff --git a/mm/filemap.c b/mm/filemap.c > index 7a13f6a..1b7dff6 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1417,8 +1417,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, > if (retval) > return retval; > > - /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */ > - if (filp->f_flags & O_DIRECT) { > + if (io_is_direct(filp)) { > loff_t size; > struct address_space *mapping; > struct inode *inode; > @@ -2468,8 +2467,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, > if (err) > goto out; > > - /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */ > - if (unlikely(file->f_flags & O_DIRECT)) { > + if (io_is_direct(file)) { > loff_t endbyte; > ssize_t written_buffered; > Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 10/22] Remove get_xip_mem Date: Tue, 8 Apr 2014 20:20:59 +0200 Message-ID: <20140408182059.GA26019@quack.suse.cz> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sun 23-03-14 15:08:36, Matthew Wilcox wrote: > All callers of get_xip_mem() are now gone. Remove checks for it, > initialisers of it, documentation of it and the only implementation of it. > > Add documentation for writing a filesystem that supports DAX. > > Signed-off-by: Matthew Wilcox > Reviewed-by: Randy Dunlap The patch looks good. You can add: Reviewed-by: Jan Kara Honza > --- > Documentation/filesystems/Locking | 3 -- > Documentation/filesystems/dax.txt | 82 +++++++++++++++++++++++++++++++++++++++ > Documentation/filesystems/xip.txt | 71 --------------------------------- > fs/exofs/inode.c | 1 - > fs/ext2/inode.c | 1 - > fs/ext2/xip.c | 37 ------------------ > fs/ext2/xip.h | 3 -- > fs/open.c | 5 +-- > include/linux/fs.h | 2 - > mm/fadvise.c | 6 ++- > mm/madvise.c | 2 +- > 11 files changed, 88 insertions(+), 125 deletions(-) > create mode 100644 Documentation/filesystems/dax.txt > delete mode 100644 Documentation/filesystems/xip.txt > > diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking > index 5b0c083..2780d47 100644 > --- a/Documentation/filesystems/Locking > +++ b/Documentation/filesystems/Locking > @@ -194,8 +194,6 @@ prototypes: > void (*freepage)(struct page *); > int (*direct_IO)(int, struct kiocb *, const struct iovec *iov, > loff_t offset, unsigned long nr_segs); > - int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **, > - unsigned long *); > int (*migratepage)(struct address_space *, struct page *, struct page *); > int (*launder_page)(struct page *); > int (*is_partially_uptodate)(struct page *, read_descriptor_t *, unsigned long); > @@ -220,7 +218,6 @@ invalidatepage: yes > releasepage: yes > freepage: yes > direct_IO: > -get_xip_mem: maybe > migratepage: yes (both) > launder_page: yes > is_partially_uptodate: yes > diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt > new file mode 100644 > index 0000000..06f84e5 > --- /dev/null > +++ b/Documentation/filesystems/dax.txt > @@ -0,0 +1,82 @@ > +Execute-in-place for file mappings > +---------------------------------- > + > +Motivation > +---------- > + > +File mappings are usually performed by mapping page cache pages to > +userspace. In addition, read & write file operations also transfer data > +between the page cache and storage. > + > +For memory backed storage devices that use the block device interface, > +the page cache pages are just copies of the original storage. The > +execute-in-place code removes the extra copy by performing reads and > +writes directly on the memory backed storage device. For file mappings, > +the storage device itself is mapped directly into userspace. > + > + > +Implementation Tips for Block Driver Writers > +-------------------------------------------- > + > +To support DAX in your block driver, implement the 'direct_access' > +block device operation. It is used to translate the sector number > +(expressed in units of 512-byte sectors) to a page frame number (pfn) > +that identifies the physical page for the memory. It also returns a > +kernel virtual address that can be used to access the memory. > + > +The direct_access method takes a 'size' parameter that indicates the > +number of bytes being requested. The function should return the number > +of bytes that it can provide, although it must not exceed the number of > +bytes requested. It may also return a negative errno if an error occurs. > + > +In order to support this method, the storage must be byte-accessible by > +the CPU at all times. If your device uses paging techniques to expose > +a large amount of memory through a smaller window, then you cannot > +implement direct_access. Equally, if your device can occasionally > +stall the CPU for an extended period, you should also not attempt to > +implement direct_access. > + > +These block devices may be used for inspiration: > +- axonram: Axon DDR2 device driver > +- brd: RAM backed block device driver > +- dcssblk: s390 dcss block device driver > + > + > +Implementation Tips for Filesystem Writers > +------------------------------------------ > + > +Filesystem support consists of > +- adding support to mark inodes as being DAX by setting the S_DAX flag in > + i_flags > +- implementing the direct_IO address space operation, and calling > + dax_do_io() instead of blockdev_direct_IO() if S_DAX is set > +- implementing an mmap file operation for DAX files which sets the > + VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers > + for fault and page_mkwrite (which should probably call dax_fault() and > + dax_mkwrite(), passing the appropriate get_block() callback) > +- calling dax_truncate_page() instead of block_truncate_page() for DAX files > +- ensuring that there is sufficient locking between reads, writes, > + truncates and page faults > + > +The get_block() callback passed to the DAX functions may return > +uninitialised extents. If it does, it must ensure that simultaneous > +calls to get_block() (for example by a page-fault racing with a read() > +or a write()) work correctly. > + > +These filesystems may be used for inspiration: > +- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt > + > + > +Shortcomings > +------------ > + > +Even if the kernel or its modules are stored on a filesystem that supports > +DAX on a block device that supports DAX, they will still be copied into RAM. > + > +Calling get_user_pages() on a range of user memory that has been mmaped > +from a DAX file will fail as there are no 'struct page' to describe > +those pages. This problem is being worked on. That means that O_DIRECT > +reads/writes to those memory ranges from a non-DAX file will fail (note > +that O_DIRECT reads/writes _of a DAX file_ do work, it is the memory > +that is being accessed that is key here). Other things that will not > +work include RDMA, sendfile() and splice(). > diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt > deleted file mode 100644 > index b62eabf..0000000 > --- a/Documentation/filesystems/xip.txt > +++ /dev/null > @@ -1,71 +0,0 @@ > -Execute-in-place for file mappings > ----------------------------------- > - > -Motivation > ----------- > -File mappings are performed by mapping page cache pages to userspace. In > -addition, read&write type file operations also transfer data from/to the page > -cache. > - > -For memory backed storage devices that use the block device interface, the page > -cache pages are in fact copies of the original storage. Various approaches > -exist to work around the need for an extra copy. The ramdisk driver for example > -does read the data into the page cache, keeps a reference, and discards the > -original data behind later on. > - > -Execute-in-place solves this issue the other way around: instead of keeping > -data in the page cache, the need to have a page cache copy is eliminated > -completely. With execute-in-place, read&write type operations are performed > -directly from/to the memory backed storage device. For file mappings, the > -storage device itself is mapped directly into userspace. > - > -This implementation was initially written for shared memory segments between > -different virtual machines on s390 hardware to allow multiple machines to > -share the same binaries and libraries. > - > -Implementation > --------------- > -Execute-in-place is implemented in three steps: block device operation, > -address space operation, and file operations. > - > -A block device operation named direct_access is used to translate the > -block device sector number to a page frame number (pfn) that identifies > -the physical page for the memory. It also returns a kernel virtual > -address that can be used to access the memory. > - > -The direct_access method takes a 'size' parameter that indicates the > -number of bytes being requested. The function should return the number > -of bytes that it can provide, although it must not exceed the number of > -bytes requested. It may also return a negative errno if an error occurs. > - > -The block device operation is optional, these block devices support it as of > -today: > -- dcssblk: s390 dcss block device driver > - > -An address space operation named get_xip_mem is used to retrieve references > -to a page frame number and a kernel address. To obtain these values a reference > -to an address_space is provided. This function assigns values to the kmem and > -pfn parameters. The third argument indicates whether the function should allocate > -blocks if needed. > - > -This address space operation is mutually exclusive with readpage&writepage that > -do page cache read/write operations. > -The following filesystems support it as of today: > -- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt > - > -A set of file operations that do utilize get_xip_page can be found in > -mm/filemap_xip.c . The following file operation implementations are provided: > -- aio_read/aio_write > -- readv/writev > -- sendfile > - > -The generic file operations do_sync_read/do_sync_write can be used to implement > -classic synchronous IO calls. > - > -Shortcomings > ------------- > -This implementation is limited to storage devices that are cpu addressable at > -all times (no highmem or such). It works well on rom/ram, but enhancements are > -needed to make it work with flash in read+write mode. > -Putting the Linux kernel and/or its modules on a xip filesystem does not mean > -they are not copied. > diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c > index ee4317fa..f9a5bf6 100644 > --- a/fs/exofs/inode.c > +++ b/fs/exofs/inode.c > @@ -985,7 +985,6 @@ const struct address_space_operations exofs_aops = { > .direct_IO = exofs_direct_IO, > > /* With these NULL has special meaning or default is not exported */ > - .get_xip_mem = NULL, > .migratepage = NULL, > .launder_page = NULL, > .is_partially_uptodate = NULL, > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index 252481f..b156fe8 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -891,7 +891,6 @@ const struct address_space_operations ext2_aops = { > > const struct address_space_operations ext2_aops_xip = { > .bmap = ext2_bmap, > - .get_xip_mem = ext2_get_xip_mem, > .direct_IO = ext2_direct_IO, > }; > > diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c > index fa40091..ca745ff 100644 > --- a/fs/ext2/xip.c > +++ b/fs/ext2/xip.c > @@ -22,27 +22,6 @@ static inline long __inode_direct_access(struct inode *inode, sector_t block, > return ops->direct_access(bdev, sector, kaddr, pfn, size); > } > > -static inline int > -__ext2_get_block(struct inode *inode, pgoff_t pgoff, int create, > - sector_t *result) > -{ > - struct buffer_head tmp; > - int rc; > - > - memset(&tmp, 0, sizeof(struct buffer_head)); > - tmp.b_size = 1 << inode->i_blkbits; > - rc = ext2_get_block(inode, pgoff, &tmp, create); > - *result = tmp.b_blocknr; > - > - /* did we get a sparse block (hole in the file)? */ > - if (!tmp.b_blocknr && !rc) { > - BUG_ON(create); > - rc = -ENODATA; > - } > - > - return rc; > -} > - > int > ext2_clear_xip_target(struct inode *inode, sector_t block) > { > @@ -69,19 +48,3 @@ void ext2_xip_verify_sb(struct super_block *sb) > "not supported by bdev"); > } > } > - > -int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create, > - void **kmem, unsigned long *pfn) > -{ > - long rc; > - sector_t block; > - > - /* first, retrieve the sector number */ > - rc = __ext2_get_block(mapping->host, pgoff, create, &block); > - if (rc) > - return rc; > - > - /* retrieve address of the target data */ > - rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE); > - return (rc < 0) ? rc : 0; > -} > diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h > index 29be737..0fa8b7f 100644 > --- a/fs/ext2/xip.h > +++ b/fs/ext2/xip.h > @@ -14,11 +14,8 @@ static inline int ext2_use_xip (struct super_block *sb) > struct ext2_sb_info *sbi = EXT2_SB(sb); > return (sbi->s_mount_opt & EXT2_MOUNT_XIP); > } > -int ext2_get_xip_mem(struct address_space *, pgoff_t, int, > - void **, unsigned long *); > #else > #define ext2_xip_verify_sb(sb) do { } while (0) > #define ext2_use_xip(sb) 0 > #define ext2_clear_xip_target(inode, chain) 0 > -#define ext2_get_xip_mem NULL > #endif > diff --git a/fs/open.c b/fs/open.c > index b9ed8b2..bc9f002 100644 > --- a/fs/open.c > +++ b/fs/open.c > @@ -665,11 +665,8 @@ int open_check_o_direct(struct file *f) > { > /* NB: we're sure to have correct a_ops only after f_op->open */ > if (f->f_flags & O_DIRECT) { > - if (!f->f_mapping->a_ops || > - ((!f->f_mapping->a_ops->direct_IO) && > - (!f->f_mapping->a_ops->get_xip_mem))) { > + if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO) > return -EINVAL; > - } > } > return 0; > } > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 9752ae5..c777056 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -375,8 +375,6 @@ struct address_space_operations { > void (*freepage)(struct page *); > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, > loff_t offset, unsigned long nr_segs); > - int (*get_xip_mem)(struct address_space *, pgoff_t, int, > - void **, unsigned long *); > /* > * migrate the contents of a page to the specified target. If > * migrate_mode is MIGRATE_ASYNC, it must not block. > diff --git a/mm/fadvise.c b/mm/fadvise.c > index 3bcfd81..1f1925f 100644 > --- a/mm/fadvise.c > +++ b/mm/fadvise.c > @@ -28,6 +28,7 @@ > SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) > { > struct fd f = fdget(fd); > + struct inode *inode; > struct address_space *mapping; > struct backing_dev_info *bdi; > loff_t endbyte; /* inclusive */ > @@ -39,7 +40,8 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) > if (!f.file) > return -EBADF; > > - if (S_ISFIFO(file_inode(f.file)->i_mode)) { > + inode = file_inode(f.file); > + if (S_ISFIFO(inode->i_mode)) { > ret = -ESPIPE; > goto out; > } > @@ -50,7 +52,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) > goto out; > } > > - if (mapping->a_ops->get_xip_mem) { > + if (IS_DAX(inode)) { > switch (advice) { > case POSIX_FADV_NORMAL: > case POSIX_FADV_RANDOM: > diff --git a/mm/madvise.c b/mm/madvise.c > index 539eeb9..b6a2f52 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -236,7 +236,7 @@ static long madvise_willneed(struct vm_area_struct *vma, > if (!file) > return -EBADF; > > - if (file->f_mapping->a_ops->get_xip_mem) { > + if (IS_DAX(file_inode(file))) { > /* no bad return value, but ignore advice */ > return 0; > } > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 09/22] Remove mm/filemap_xip.c Date: Tue, 8 Apr 2014 20:21:35 +0200 Message-ID: <20140408182135.GB26019@quack.suse.cz> References: <69ab315f0124881ae74d9881c48c7bdc70368fd1.1395591795.git.matthew.r.wilcox@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <69ab315f0124881ae74d9881c48c7bdc70368fd1.1395591795.git.matthew.r.wilcox@intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sun 23-03-14 15:08:35, Matthew Wilcox wrote: > It is now empty as all of its contents have been replaced by fs/xip.c Looks good. You can add: Reviewed-by: Jan Kara Honza > > Signed-off-by: Matthew Wilcox > --- > mm/Makefile | 1 - > mm/filemap_xip.c | 23 ----------------------- > 2 files changed, 24 deletions(-) > delete mode 100644 mm/filemap_xip.c > > diff --git a/mm/Makefile b/mm/Makefile > index 310c90a..454c176 100644 > --- a/mm/Makefile > +++ b/mm/Makefile > @@ -47,7 +47,6 @@ obj-$(CONFIG_SLUB) += slub.o > obj-$(CONFIG_KMEMCHECK) += kmemcheck.o > obj-$(CONFIG_FAILSLAB) += failslab.o > obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o > -obj-$(CONFIG_FS_XIP) += filemap_xip.o > obj-$(CONFIG_MIGRATION) += migrate.o > obj-$(CONFIG_QUICKLIST) += quicklist.o > obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o > diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c > deleted file mode 100644 > index 6316578..0000000 > --- a/mm/filemap_xip.c > +++ /dev/null > @@ -1,23 +0,0 @@ > -/* > - * linux/mm/filemap_xip.c > - * > - * Copyright (C) 2005 IBM Corporation > - * Author: Carsten Otte > - * > - * derived from linux/mm/filemap.c - Copyright (C) Linus Torvalds > - * > - */ > - > -#include > -#include > -#include > -#include > -#include > -#include > -#include > -#include > -#include > -#include > -#include > -#include > - > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Wed, 9 Apr 2014 00:05:25 +0200 Message-ID: <20140408220525.GC26019@quack.suse.cz> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sun 23-03-14 15:08:33, Matthew Wilcox wrote: > Instead of calling aops->get_xip_mem from the fault handler, the > filesystem passes a get_block_t that is used to find the appropriate > blocks. I have some suggestions below... > Signed-off-by: Matthew Wilcox > --- > fs/dax.c | 207 +++++++++++++++++++++++++++++++++++++++++++++++++++++ > fs/ext2/file.c | 35 ++++++++- > include/linux/fs.h | 4 +- > mm/filemap_xip.c | 206 ---------------------------------------------------- > 4 files changed, 243 insertions(+), 209 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 66a6bda..863749c 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -19,8 +19,12 @@ > #include > #include > #include > +#include > +#include > +#include > #include > #include > +#include > > static long dax_get_addr(struct inode *inode, struct buffer_head *bh, > void **addr) > @@ -32,6 +36,16 @@ static long dax_get_addr(struct inode *inode, struct buffer_head *bh, > return ops->direct_access(bdev, sector, addr, &pfn, bh->b_size); > } > > +static long dax_get_pfn(struct inode *inode, struct buffer_head *bh, > + unsigned long *pfn) > +{ > + struct block_device *bdev = bh->b_bdev; > + const struct block_device_operations *ops = bdev->bd_disk->fops; > + void *addr; > + sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9); > + return ops->direct_access(bdev, sector, &addr, pfn, bh->b_size); > +} > + > static void dax_new_buf(void *addr, unsigned size, unsigned first, > loff_t offset, loff_t end, int rw) > { > @@ -214,3 +228,196 @@ ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode, > return retval; > } > EXPORT_SYMBOL_GPL(dax_do_io); > + > +/* > + * The user has performed a load from a hole in the file. Allocating > + * a new page in the file would cause excessive storage usage for > + * workloads with sparse files. We allocate a page cache page instead. > + * We'll kick it out of the page cache if it's ever written to, > + * otherwise it will simply fall out of the page cache under memory > + * pressure without ever having been dirtied. > + */ > +static int dax_load_hole(struct address_space *mapping, struct page *page, > + struct vm_fault *vmf) > +{ > + unsigned long size; > + struct inode *inode = mapping->host; > + if (!page) > + page = find_or_create_page(mapping, vmf->pgoff, > + GFP_KERNEL | __GFP_ZERO); > + if (!page) > + return VM_FAULT_OOM; > + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; > + if (vmf->pgoff >= size) { Maybe comment here that we have to recheck i_size so that we don't create pages in the area truncate_pagecache() has already evicted. > + unlock_page(page); > + page_cache_release(page); > + return VM_FAULT_SIGBUS; > + } > + > + vmf->page = page; > + return VM_FAULT_LOCKED; > +} > + > +static void copy_user_bh(struct page *to, struct inode *inode, > + struct buffer_head *bh, unsigned long vaddr) > +{ > + void *vfrom, *vto; > + dax_get_addr(inode, bh, &vfrom); /* XXX: error handling */ The error handling here is missing as the comment suggests :) > + vto = kmap_atomic(to); > + copy_user_page(vto, vfrom, vaddr, to); > + kunmap_atomic(vto); > +} > + > +static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > + get_block_t get_block) > +{ > + struct file *file = vma->vm_file; > + struct inode *inode = file_inode(file); > + struct address_space *mapping = file->f_mapping; > + struct page *page; > + struct buffer_head bh; > + unsigned long vaddr = (unsigned long)vmf->virtual_address; > + sector_t block; > + pgoff_t size; > + unsigned long pfn; > + int error; > + int major = 0; > + > + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; > + if (vmf->pgoff >= size) > + return VM_FAULT_SIGBUS; > + > + memset(&bh, 0, sizeof(bh)); > + block = (sector_t)vmf->pgoff << (PAGE_SHIFT - inode->i_blkbits); > + bh.b_size = PAGE_SIZE; > + > + repeat: > + page = find_get_page(mapping, vmf->pgoff); > + if (page) { > + if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) { > + page_cache_release(page); > + return VM_FAULT_RETRY; > + } > + if (unlikely(page->mapping != mapping)) { > + unlock_page(page); > + page_cache_release(page); > + goto repeat; > + } > + } > + > + error = get_block(inode, block, &bh, 0); > + if (error || bh.b_size < PAGE_SIZE) > + goto sigbus; > + > + if (!buffer_written(&bh) && !vmf->cow_page) { > + if (vmf->flags & FAULT_FLAG_WRITE) { > + error = get_block(inode, block, &bh, 1); > + count_vm_event(PGMAJFAULT); > + mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); > + major = VM_FAULT_MAJOR; > + if (error || bh.b_size < PAGE_SIZE) > + goto sigbus; > + } else { > + return dax_load_hole(mapping, page, vmf); > + } > + } > + > + /* Recheck i_size under i_mmap_mutex */ > + mutex_lock(&mapping->i_mmap_mutex); > + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; > + if (unlikely(vmf->pgoff >= size)) { > + mutex_unlock(&mapping->i_mmap_mutex); > + goto sigbus; > + } > + if (vmf->cow_page) { > + if (buffer_written(&bh)) > + copy_user_bh(vmf->cow_page, inode, &bh, vaddr); > + else > + clear_user_highpage(vmf->cow_page, vaddr); > + if (page) { > + unlock_page(page); > + page_cache_release(page); > + } > + /* do_cow_fault() will release the i_mmap_mutex */ > + return VM_FAULT_COWED; > + } > + > + if (buffer_unwritten(&bh) || buffer_new(&bh)) > + dax_clear_blocks(inode, bh.b_blocknr, bh.b_size); Where is dax_clear_blocks() defined? > + > + error = dax_get_pfn(inode, &bh, &pfn); > + if (error > 0) > + error = vm_insert_mixed(vma, vaddr, pfn); When there's a hole (thus page != NULL) and we are called from dax_mkwrite(), this will always return EBUSY, correct? > + mutex_unlock(&mapping->i_mmap_mutex); > + > + if (page) { > + delete_from_page_cache(page); > + unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, > + PAGE_CACHE_SIZE, 0); Here we unmap the PTE pointing to the hole page but then we'll have to retry the fault again to fill in the pfn we've got? This seems wrong. I'd say we want to remap the PTE from the hole page to a pfn we've got while holding i_mmap_mutex. remap_pfn_range() almost does what you need, except that you also need that to work for normal pages. So you might need to create a new helper in mm layer for that. > + unlock_page(page); > + page_cache_release(page); > + } > + > + if (error == -ENOMEM) > + return VM_FAULT_OOM; > + /* -EBUSY is fine, somebody else faulted on the same PTE */ > + if (error != -EBUSY) > + BUG_ON(error); > + return VM_FAULT_NOPAGE | major; > + > + sigbus: > + if (page) { > + unlock_page(page); > + page_cache_release(page); > + } > + return VM_FAULT_SIGBUS; > +} > + > +/** > + * dax_fault - handle a page fault on an XIP file > + * @vma: The virtual memory area where the fault occurred > + * @vmf: The description of the fault > + * @get_block: The filesystem method used to translate file offsets to blocks > + * > + * When a page fault occurs, filesystems may call this helper in their > + * fault handler for XIP files. > + */ > +int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > + get_block_t get_block) > +{ > + int result; > + struct super_block *sb = file_inode(vma->vm_file)->i_sb; > + > + sb_start_pagefault(sb); You don't need any filesystem freeze protection for the fault handler since that's not going to modify the filesystem. > + file_update_time(vma->vm_file); Why do you update m/ctime? We are only reading the file... > + result = do_dax_fault(vma, vmf, get_block); > + sb_end_pagefault(sb); > + > + return result; > +} > +EXPORT_SYMBOL_GPL(dax_fault); > + > +/** > + * dax_mkwrite - convert a read-only page to read-write in an XIP file > + * @vma: The virtual memory area where the fault occurred > + * @vmf: The description of the fault > + * @get_block: The filesystem method used to translate file offsets to blocks > + * > + * XIP handles reads of holes by adding pages full of zeroes into the > + * mapping. If the page is subsequenty written to, we have to allocate > + * the page on media and free the page that was in the cache. > + */ > +int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, > + get_block_t get_block) > +{ > + int result; > + struct super_block *sb = file_inode(vma->vm_file)->i_sb; > + > + sb_start_pagefault(sb); > + file_update_time(vma->vm_file); > + result = do_dax_fault(vma, vmf, get_block); > + sb_end_pagefault(sb); > + > + return result; > +} > +EXPORT_SYMBOL_GPL(dax_mkwrite); > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > index ef5cf96..e3ce10d 100644 > --- a/fs/ext2/file.c > +++ b/fs/ext2/file.c > @@ -25,6 +25,37 @@ > #include "xattr.h" > #include "acl.h" > > +#ifdef CONFIG_EXT2_FS_XIP > +static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf) > +{ > + return dax_fault(vma, vmf, ext2_get_block); > +} > + > +static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > +{ > + return dax_mkwrite(vma, vmf, ext2_get_block); > +} > + > +static const struct vm_operations_struct ext2_dax_vm_ops = { > + .fault = ext2_dax_fault, > + .page_mkwrite = ext2_dax_mkwrite, > + .remap_pages = generic_file_remap_pages, > +}; > + > +static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma) > +{ > + if (!IS_DAX(file_inode(file))) > + return generic_file_mmap(file, vma); > + > + file_accessed(file); > + vma->vm_ops = &ext2_dax_vm_ops; > + vma->vm_flags |= VM_MIXEDMAP; > + return 0; > +} > +#else > +#define ext2_file_mmap generic_file_mmap > +#endif > + > /* > * Called when filp is released. This happens when all file descriptors > * for a single struct file are closed. Note that different open() calls > @@ -70,7 +101,7 @@ const struct file_operations ext2_file_operations = { > #ifdef CONFIG_COMPAT > .compat_ioctl = ext2_compat_ioctl, > #endif > - .mmap = generic_file_mmap, > + .mmap = ext2_file_mmap, So what's the point of ext2_file_operations ever handling IS_DAX() inodes? Actually ext2_file_operations and ext2_xip_file_operations seem to be the same after this patch so either you drop ext2_xip_file_operations (I'm for this) or you can leave generic_file_mmap here and assume ext2_file_mmap is always called for IS_DAX() inodes. > .open = dquot_file_open, > .release = ext2_release_file, > .fsync = ext2_fsync, > @@ -89,7 +120,7 @@ const struct file_operations ext2_xip_file_operations = { > #ifdef CONFIG_COMPAT > .compat_ioctl = ext2_compat_ioctl, > #endif > - .mmap = xip_file_mmap, > + .mmap = ext2_file_mmap, > .open = dquot_file_open, > .release = ext2_release_file, > .fsync = ext2_fsync, Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 08/22] Replace xip_truncate_page with dax_truncate_page Date: Wed, 9 Apr 2014 00:17:59 +0200 Message-ID: <20140408221759.GD26019@quack.suse.cz> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sun 23-03-14 15:08:34, Matthew Wilcox wrote: > It takes a get_block parameter just like nobh_truncate_page() and > block_truncate_page() The patch looks mostly OK. Some minor comments below. > > Signed-off-by: Matthew Wilcox > --- > fs/dax.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++---- > fs/ext2/inode.c | 2 +- > include/linux/fs.h | 4 ++-- > mm/filemap_xip.c | 40 ---------------------------------------- > 4 files changed, 51 insertions(+), 47 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 863749c..7271be0 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -374,13 +374,13 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > } > > /** > - * dax_fault - handle a page fault on an XIP file > + * dax_fault - handle a page fault on a DAX file > * @vma: The virtual memory area where the fault occurred > * @vmf: The description of the fault > * @get_block: The filesystem method used to translate file offsets to blocks > * > * When a page fault occurs, filesystems may call this helper in their > - * fault handler for XIP files. > + * fault handler for DAX files. > */ > int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > get_block_t get_block) > @@ -398,12 +398,12 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > EXPORT_SYMBOL_GPL(dax_fault); > > /** > - * dax_mkwrite - convert a read-only page to read-write in an XIP file > + * dax_mkwrite - convert a read-only page to read-write in a DAX file > * @vma: The virtual memory area where the fault occurred > * @vmf: The description of the fault > * @get_block: The filesystem method used to translate file offsets to blocks > * > - * XIP handles reads of holes by adding pages full of zeroes into the > + * DAX handles reads of holes by adding pages full of zeroes into the > * mapping. If the page is subsequenty written to, we have to allocate > * the page on media and free the page that was in the cache. > */ Above two hunks belong to the previous patch... > @@ -421,3 +421,47 @@ int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, > return result; > } > EXPORT_SYMBOL_GPL(dax_mkwrite); > + > +/** > + * dax_truncate_page - handle a partial page being truncated in a DAX file > + * @inode: The file being truncated > + * @from: The file offset that is being truncated to > + * @get_block: The filesystem method used to translate file offsets to blocks > + * > + * Similar to block_truncate_page(), this function can be called by a > + * filesystem when it is truncating an DAX file to handle the partial page. > + * > + * We work in terms of PAGE_CACHE_SIZE here for commonality with > + * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem > + * took care of disposing of the unnecessary blocks. Even if the filesystem > + * block size is smaller than PAGE_SIZE, we have to zero the rest of the page > + * since the file might be mmaped. Well, DAX mmap support pretty much relies on PAGE_CACHE_SIZE == block size (we cannot really map only a part of a physical page directly...). So the comment seems somewhat misleading. > + */ > +int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) > +{ > + struct buffer_head bh; > + pgoff_t index = from >> PAGE_CACHE_SHIFT; > + unsigned offset = from & (PAGE_CACHE_SIZE-1); > + unsigned length = PAGE_CACHE_ALIGN(from) - from; > + int err; > + Can we WARN_ON_ONCE here if PAGE_CACHE_SHIFT != inode->i_blkbits? Just to catch bugs early. > + /* Block boundary? Nothing to do */ > + if (!length) > + return 0; > + > + memset(&bh, 0, sizeof(bh)); > + bh.b_size = PAGE_CACHE_SIZE; > + err = get_block(inode, index, &bh, 0); > + if (err < 0) > + return err; > + if (buffer_written(&bh)) { > + void *addr; > + err = dax_get_addr(inode, &bh, &addr); > + if (err) > + return err; > + memset(addr + offset, 0, length); > + } > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(dax_truncate_page); Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH v7 06/22] Replace XIP read and write with DAX I/O Date: Tue, 8 Apr 2014 16:21:02 -0400 Message-ID: <20140408202102.GB5727@linux.intel.com> References: <3ebe329d8713f7db4c105021a845316a47a29797.1395591795.git.matthew.r.wilcox@intel.com> <20140408175600.GE2713@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Jan Kara Return-path: Content-Disposition: inline In-Reply-To: <20140408175600.GE2713@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Tue, Apr 08, 2014 at 07:56:00PM +0200, Jan Kara wrote: > > +static void dax_new_buf(void *addr, unsigned size, unsigned first, > > + loff_t offset, loff_t end, int rw) > > +{ > > + loff_t final = end - offset + first; /* The final byte of the buffer */ > > + if (rw != WRITE) { > > + memset(addr, 0, size); > > + return; > > + } > It seems counterintuitive to zero out "on-disk" blocks (it seems you'd do > this for unwritten blocks) when reading from them. Presumably it could also > have undesired effects on endurance of persistent memory. Instead I'd expect > that you simply zero out user provided buffer the same way as you do it for > holes. I think we have to zero it here, because the second time we call get_block() for a given block, it won't be BH_New any more, so we won't know that it's supposed to be zeroed. > > +/* > > + * When ext4 encounters a hole, it likes to return without modifying the > > + * buffer_head which means that we can't trust b_size. To cope with this, > > + * we set b_state to 0 before calling get_block and, if any bit is set, we > > + * know we can trust b_size. Unfortunate, really, since ext4 does know > > + * precisely how long a hole is and would save us time calling get_block > > + * repeatedly. > Well, this is really a problem of get_blocks() returning the result in > struct buffer_head which is used for input as well. I don't think it is > actually ext4 specific. Of course it's ext4 specific! It's the ext4_get_block() implementation which is choosing not to return the length of the hole. XFS does return the length of the hole. I think something like this would fix it: +++ b/fs/ext4/inode.c @@ -727,14 +727,14 @@ static int _ext4_get_block(struct inode *inode, sector_t i } ret = ext4_map_blocks(handle, inode, &map, flags); + map_bh(bh, inode->i_sb, map.m_pblk); + bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags; + bh->b_size = inode->i_sb->s_blocksize * map.m_len; if (ret > 0) { ext4_io_end_t *io_end = ext4_inode_aio(inode); - map_bh(bh, inode->i_sb, map.m_pblk); - bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags; if (io_end && io_end->flag & EXT4_IO_END_UNWRITTEN) set_buffer_defer_completion(bh); - bh->b_size = inode->i_sb->s_blocksize * map.m_len; ret = 0; } if (started) (completely untested). > > + while (offset < end) { > > + void __user *buf = iov[seg].iov_base + copied; > > + > > + if (offset == max) { > > + sector_t block = offset >> inode->i_blkbits; > > + unsigned first = offset - (block << inode->i_blkbits); > > + long size; > > + > > + if (offset == bh_max) { > > + bh->b_size = PAGE_ALIGN(end - offset); > > + bh->b_state = 0; > > + retval = get_block(inode, block, bh, > > + rw == WRITE); > > + if (retval) > > + break; > > + if (!buffer_size_valid(bh)) > > + bh->b_size = 1 << inode->i_blkbits; > > + bh_max = offset - first + bh->b_size; > > + } else { > > + unsigned done = bh->b_size - (bh_max - > > + (offset - first)); > > + bh->b_blocknr += done >> inode->i_blkbits; > > + bh->b_size -= done; > It took me quite some time to figure out what this does and whether it is > correct :). Why isn't this at the place where we advance all other > iterators like offset, addr, etc.? It'll be kind of tricky to move it because 'len' is not necessarily a multiple of i_blkbits, so we can't necessarily maintain b_blocknr accurately. > > + if (rw == WRITE) { > > + if (!buffer_mapped(bh)) { > > + retval = -EIO; > > + break; > -EIO looks like a wrong error here. Or maybe it is the right one and it > only needs some explanation? The thing is that for direct IO some > filesystems choose not to fill holes for direct IO and fall back to > buffered IO instead (to avoid exposure of uninitialized blocks if the > system crashes after blocks have been added to a file but before they were > written out). For DAX you are pretty much free to define what you ask from > the get_blocks() (and this fallback behavior is somewhat disputed behavior > in direct IO case so you might want to differ here) but you should document > it somewhere. Hmm ... I thought that calling get_block() with the create argument would force the return of a bh with the Mapped bit set. Did I misunderstand that aspect of the undocumented get_block() API too? > > + if ((flags & DIO_LOCKING) && (rw == READ)) { > > + struct address_space *mapping = inode->i_mapping; > > + mutex_lock(&inode->i_mutex); > > + retval = filemap_write_and_wait_range(mapping, offset, end - 1); > > + if (retval) { > > + mutex_unlock(&inode->i_mutex); > > + goto out; > > + } > Is there a reason for this? I'd assume DAX has no pages in pagecache... There will be pages in the page cache for holes that we page faulted on. They must go! :-) > > @@ -858,7 +858,11 @@ ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, > > struct inode *inode = mapping->host; > > ssize_t ret; > > > > - ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, > > + if (IS_DAX(inode)) > > + ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs, > > + ext2_get_block, NULL, DIO_LOCKING); > > + else > > + ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, > > ext2_get_block); > I'd somewhat prefer to have a ext2_direct_IO() as is and have > ext2_dax_IO() call only dax_do_io() (and use that as .direct_io in > ext2_aops_xip). Then there's no need to check IS_DAX() and the code would > look more obvious to me. But I don't feel strongly about it. I can look at that ... but I was hoping to not have separate aops for XIP and non-XIP files. > > @@ -2681,6 +2686,11 @@ extern int generic_show_options(struct seq_file *m, struct dentry *root); > > extern void save_mount_options(struct super_block *sb, char *options); > > extern void replace_mount_options(struct super_block *sb, char *options); > > > > +static inline bool io_is_direct(struct file *filp) > > +{ > > + return (filp->f_flags & O_DIRECT) || IS_DAX(file_inode(filp)); > > +} > > + > BTW: It seems fs/open.c: open_check_o_direct() can be simplified to not > check for get_xip_mem(), cannot it? That's in a later patch -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 06/22] Replace XIP read and write with DAX I/O Date: Wed, 9 Apr 2014 11:14:50 +0200 Message-ID: <20140409091450.GA32103@quack.suse.cz> References: <3ebe329d8713f7db4c105021a845316a47a29797.1395591795.git.matthew.r.wilcox@intel.com> <20140408175600.GE2713@quack.suse.cz> <20140408202102.GB5727@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <20140408202102.GB5727@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Tue 08-04-14 16:21:02, Matthew Wilcox wrote: > On Tue, Apr 08, 2014 at 07:56:00PM +0200, Jan Kara wrote: > > > +static void dax_new_buf(void *addr, unsigned size, unsigned first, > > > + loff_t offset, loff_t end, int rw) > > > +{ > > > + loff_t final = end - offset + first; /* The final byte of the buffer */ > > > + if (rw != WRITE) { > > > + memset(addr, 0, size); > > > + return; > > > + } > > It seems counterintuitive to zero out "on-disk" blocks (it seems you'd do > > this for unwritten blocks) when reading from them. Presumably it could also > > have undesired effects on endurance of persistent memory. Instead I'd expect > > that you simply zero out user provided buffer the same way as you do it for > > holes. > > I think we have to zero it here, because the second time we call > get_block() for a given block, it won't be BH_New any more, so we won't > know that it's supposed to be zeroed. But how can you have BH_New buffer when you didn't ask get_blocks() to create any block? That would be a bug in the get_blocks() implementation... Or am I missing something? > > > +/* > > > + * When ext4 encounters a hole, it likes to return without modifying the > > > + * buffer_head which means that we can't trust b_size. To cope with this, > > > + * we set b_state to 0 before calling get_block and, if any bit is set, we > > > + * know we can trust b_size. Unfortunate, really, since ext4 does know > > > + * precisely how long a hole is and would save us time calling get_block > > > + * repeatedly. > > Well, this is really a problem of get_blocks() returning the result in > > struct buffer_head which is used for input as well. I don't think it is > > actually ext4 specific. > > Of course it's ext4 specific! It's the ext4_get_block() implementation > which is choosing not to return the length of the hole. XFS does return > the length of the hole. I think something like this would fix it: OK, but there are filesystems which do the same thing as ext4 (e.g. btrfs) and historically noone really cared. E.g. direct IO code advances only by a single block regardless of what filesystem returns when the buffer is unmapped. As you correctly mention, get_blocks() API isn't really documented so noone has really defined what should happen when you ask filesystem to map some blocks and there's a hole. I agree what XFS does looks sensible and ext4 can do the same. Hopefully this gets cleaned up when Dave finishes his new block mapping interface. > +++ b/fs/ext4/inode.c > @@ -727,14 +727,14 @@ static int _ext4_get_block(struct inode *inode, sector_t i > } > > ret = ext4_map_blocks(handle, inode, &map, flags); > + map_bh(bh, inode->i_sb, map.m_pblk); > + bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags; > + bh->b_size = inode->i_sb->s_blocksize * map.m_len; > if (ret > 0) { > ext4_io_end_t *io_end = ext4_inode_aio(inode); > > - map_bh(bh, inode->i_sb, map.m_pblk); > - bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags; > if (io_end && io_end->flag & EXT4_IO_END_UNWRITTEN) > set_buffer_defer_completion(bh); > - bh->b_size = inode->i_sb->s_blocksize * map.m_len; > ret = 0; > } > if (started) This wouldn't quite work because even ext4_map_blocks() doesn't bother to fill in 'map' when it finds a hole. But it won't be complicated to propagate the information. > > > + while (offset < end) { > > > + void __user *buf = iov[seg].iov_base + copied; > > > + > > > + if (offset == max) { > > > + sector_t block = offset >> inode->i_blkbits; > > > + unsigned first = offset - (block << inode->i_blkbits); > > > + long size; > > > + > > > + if (offset == bh_max) { > > > + bh->b_size = PAGE_ALIGN(end - offset); > > > + bh->b_state = 0; > > > + retval = get_block(inode, block, bh, > > > + rw == WRITE); > > > + if (retval) > > > + break; > > > + if (!buffer_size_valid(bh)) > > > + bh->b_size = 1 << inode->i_blkbits; > > > + bh_max = offset - first + bh->b_size; > > > + } else { > > > + unsigned done = bh->b_size - (bh_max - > > > + (offset - first)); > > > + bh->b_blocknr += done >> inode->i_blkbits; > > > + bh->b_size -= done; > > It took me quite some time to figure out what this does and whether it is > > correct :). Why isn't this at the place where we advance all other > > iterators like offset, addr, etc.? > > It'll be kind of tricky to move it because 'len' is not necessarily > a multiple of i_blkbits, so we can't necessarily maintain b_blocknr > accurately. Yeah, after I understood the code I also understood why you do it the way you did. But we could do something like: ... + if (!len) + break; + blocks = ((offset + len) >> inode->i_blkbits) - (offset >> inode->i_blkbits); bh->b_blocknr += blocks; bh->b_size -= blocks << inode->i_blkbits; + offset += len; + copied += len; + addr += len; ... BTW: it might be good to store inode->i_blkbits in a local variable. It makes some expressions shorter. BTW2: although direct IO uses 'offset' for position in file, the rest of VFS uses 'pos' for that and that seems to be less overloaded term so for me it would be easier if you used 'pos' instead of 'offset'. Just a suggestion. > > > + if (rw == WRITE) { > > > + if (!buffer_mapped(bh)) { > > > + retval = -EIO; > > > + break; > > -EIO looks like a wrong error here. Or maybe it is the right one and it > > only needs some explanation? The thing is that for direct IO some > > filesystems choose not to fill holes for direct IO and fall back to > > buffered IO instead (to avoid exposure of uninitialized blocks if the > > system crashes after blocks have been added to a file but before they were > > written out). For DAX you are pretty much free to define what you ask from > > the get_blocks() (and this fallback behavior is somewhat disputed behavior > > in direct IO case so you might want to differ here) but you should document > > it somewhere. > > Hmm ... I thought that calling get_block() with the create argument would > force the return of a bh with the Mapped bit set. Did I misunderstand that > aspect of the undocumented get_block() API too? As you mention the API is undocumented and not really designed. So filesystems do whatever causes the generic code to do what they want (it's a mess I know). In this case, I'm warning you there are filesystems which refuse to fill in holes from the get_blocks() function passed to blockdev_direct_IO() (even ext4 does this for inodes with old indirect-block based on disk format). You can just define DAX fails horribly in these case and I'm fine with that at least in this stage. If someone bothers later, fallback to buffered IO can be implemented. But we should document this somewhere. > > > + if ((flags & DIO_LOCKING) && (rw == READ)) { > > > + struct address_space *mapping = inode->i_mapping; > > > + mutex_lock(&inode->i_mutex); > > > + retval = filemap_write_and_wait_range(mapping, offset, end - 1); > > > + if (retval) { > > > + mutex_unlock(&inode->i_mutex); > > > + goto out; > > > + } > > Is there a reason for this? I'd assume DAX has no pages in pagecache... > > There will be pages in the page cache for holes that we page faulted on. > They must go! :-) Well, but this will only writeback dirty pages and if I read the code correctly those pages will never be dirty since dax_mkwrite() will replace them. Or am I missing something? > > > @@ -858,7 +858,11 @@ ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, > > > struct inode *inode = mapping->host; > > > ssize_t ret; > > > > > > - ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, > > > + if (IS_DAX(inode)) > > > + ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs, > > > + ext2_get_block, NULL, DIO_LOCKING); > > > + else > > > + ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, > > > ext2_get_block); > > I'd somewhat prefer to have a ext2_direct_IO() as is and have > > ext2_dax_IO() call only dax_do_io() (and use that as .direct_io in > > ext2_aops_xip). Then there's no need to check IS_DAX() and the code would > > look more obvious to me. But I don't feel strongly about it. > > I can look at that ... but I was hoping to not have separate aops for > XIP and non-XIP files. OK, if you can do that, then I'm fine with the code as is. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 08/22] Replace xip_truncate_page with dax_truncate_page Date: Wed, 9 Apr 2014 11:26:35 +0200 Message-ID: <20140409092635.GB32103@quack.suse.cz> References: <20140408221759.GD26019@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <20140408221759.GD26019@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed 09-04-14 00:17:59, Jan Kara wrote: > On Sun 23-03-14 15:08:34, Matthew Wilcox wrote: > > +/** > > + * dax_truncate_page - handle a partial page being truncated in a DAX file > > + * @inode: The file being truncated > > + * @from: The file offset that is being truncated to > > + * @get_block: The filesystem method used to translate file offsets to blocks > > + * > > + * Similar to block_truncate_page(), this function can be called by a > > + * filesystem when it is truncating an DAX file to handle the partial page. > > + * > > + * We work in terms of PAGE_CACHE_SIZE here for commonality with > > + * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem > > + * took care of disposing of the unnecessary blocks. Even if the filesystem > > + * block size is smaller than PAGE_SIZE, we have to zero the rest of the page > > + * since the file might be mmaped. > Well, DAX mmap support pretty much relies on PAGE_CACHE_SIZE == block > size (we cannot really map only a part of a physical page directly...). So > the comment seems somewhat misleading. I thought about this for a while and classical IO, truncation etc. could easily work for blocksize < pagesize. And for mmap() you could just use pagecache. Not sure if it's worth the complications though. Anyway we should decide whether we don't care about blocksize < PAGE_CACHE_SIZE at all, or whether we try to make things which can work reasonably easily functional. In that case dax_truncate_page() needs some tweaking because it currently assumes blocksize == PAGE_CACHE_SIZE. Honza > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 11/22] Replace ext2_clear_xip_target with dax_clear_blocks Date: Wed, 9 Apr 2014 11:46:44 +0200 Message-ID: <20140409094644.GD32103@quack.suse.cz> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sun 23-03-14 15:08:37, Matthew Wilcox wrote: > This is practically generic code; other filesystems will want to call > it from other places, but there's nothing ext2-specific about it. > > Make it a little more generic by allowing it to take a count of the number > of bytes to zero rather than fixing it to a single page. Thanks to Dave > Hansen for suggesting that I need to call cond_resched() if zeroing more > than one page. Another day, some more review ;) Comments below. > > Signed-off-by: Matthew Wilcox > --- > fs/dax.c | 34 ++++++++++++++++++++++++++++++++++ > fs/ext2/inode.c | 8 +++++--- > fs/ext2/xip.c | 23 ----------------------- > fs/ext2/xip.h | 3 --- > include/linux/fs.h | 6 ++++++ > 5 files changed, 45 insertions(+), 29 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 7271be0..45a0a41 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -23,9 +23,43 @@ > #include > #include > #include > +#include > #include > #include > > +int dax_clear_blocks(struct inode *inode, sector_t block, long size) > +{ > + struct block_device *bdev = inode->i_sb->s_bdev; > + const struct block_device_operations *ops = bdev->bd_disk->fops; > + sector_t sector = block << (inode->i_blkbits - 9); > + unsigned long pfn; > + > + might_sleep(); > + do { > + void *addr; > + long count = ops->direct_access(bdev, sector, &addr, &pfn, > + size); So do you assume blocksize == PAGE_SIZE here? If not, addr could be in the middle of the page AFAICT. > + if (count < 0) > + return count; > + while (count >= PAGE_SIZE) { > + clear_page(addr); > + addr += PAGE_SIZE; > + size -= PAGE_SIZE; > + count -= PAGE_SIZE; > + sector += PAGE_SIZE / 512; > + cond_resched(); > + } > + if (count > 0) { > + memset(addr, 0, count); > + sector += count / 512; > + size -= count; > + } > + } while (size); > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(dax_clear_blocks); > + > static long dax_get_addr(struct inode *inode, struct buffer_head *bh, > void **addr) > { > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index b156fe8..a9346a9 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -733,10 +733,12 @@ static int ext2_get_blocks(struct inode *inode, > > if (IS_DAX(inode)) { > /* > - * we need to clear the block > + * block must be initialised before we put it in the tree > + * so that it's not found by another thread before it's > + * initialised > */ > - err = ext2_clear_xip_target (inode, > - le32_to_cpu(chain[depth-1].key)); > + err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key), > + count << inode->i_blkbits); Umm 'count' looks wrong here. You want to clear only one block, don't you? > if (err) { > mutex_unlock(&ei->truncate_mutex); > goto cleanup; -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 12/22] ext2: Remove ext2_xip_verify_sb() Date: Wed, 9 Apr 2014 11:52:54 +0200 Message-ID: <20140409095254.GE32103@quack.suse.cz> References: <5f91cb658e1ee1b593be9fd719e8f204b0069031.1395591795.git.matthew.r.wilcox@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <5f91cb658e1ee1b593be9fd719e8f204b0069031.1395591795.git.matthew.r.wilcox@intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sun 23-03-14 15:08:38, Matthew Wilcox wrote: > Jan Kara pointed out that calling ext2_xip_verify_sb() in ext2_remount() > doesn't make sense, since changing the XIP option on remount isn't > allowed. It also doesn't make sense to re-check whether blocksize is > supported since it can't change between mounts. > > Replace the call to ext2_xip_verify_sb() in ext2_fill_super() with the > equivalent check and delete the definition. Looks good. You can add: Reviewed-by: Jan Kara One nit below: ... > @@ -1273,22 +1275,11 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data) > sb->s_flags = (sb->s_flags & ~MS_POSIXACL) | > ((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0); > > - ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset > - EXT2_MOUNT_XIP if not */ > - > - if ((ext2_use_xip(sb)) && (sb->s_blocksize != PAGE_SIZE)) { > - ext2_msg(sb, KERN_WARNING, > - "warning: unsupported blocksize for xip"); > - err = -EINVAL; > - goto restore_opts; > - } > - > es = sbi->s_es; > - if ((sbi->s_mount_opt ^ old_mount_opt) & EXT2_MOUNT_XIP) { > + if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) { > ext2_msg(sb, KERN_WARNING, "warning: refusing change of " > "xip flag with busy inodes while remounting"); > - sbi->s_mount_opt &= ~EXT2_MOUNT_XIP; > - sbi->s_mount_opt |= old_mount_opt & EXT2_MOUNT_XIP; > + sbi->s_mount_opt ^= EXT2_MOUNT_XIP; Although this is correct, it was easier to see that the previous code is correct so I'd prefer if you kept it that way. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 13/22] ext2: Remove ext2_use_xip Date: Wed, 9 Apr 2014 11:55:49 +0200 Message-ID: <20140409095549.GF32103@quack.suse.cz> References: <0c65dcd599646e3054d0c524a0c5b25b07885763.1395591795.git.matthew.r.wilcox@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <0c65dcd599646e3054d0c524a0c5b25b07885763.1395591795.git.matthew.r.wilcox@intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sun 23-03-14 15:08:39, Matthew Wilcox wrote: > Replace ext2_use_xip() with test_opt(XIP) which expands to the same code Looks good. You can add: Reviewed-by: Jan Kara Honza > > Signed-off-by: Matthew Wilcox > --- > fs/ext2/ext2.h | 4 ++++ > fs/ext2/inode.c | 2 +- > fs/ext2/namei.c | 4 ++-- > 3 files changed, 7 insertions(+), 3 deletions(-) > > diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h > index d9a17d0..5ecf570 100644 > --- a/fs/ext2/ext2.h > +++ b/fs/ext2/ext2.h > @@ -380,7 +380,11 @@ struct ext2_inode { > #define EXT2_MOUNT_NO_UID32 0x000200 /* Disable 32-bit UIDs */ > #define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */ > #define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */ > +#ifdef CONFIG_FS_XIP > #define EXT2_MOUNT_XIP 0x010000 /* Execute in place */ > +#else > +#define EXT2_MOUNT_XIP 0 > +#endif > #define EXT2_MOUNT_USRQUOTA 0x020000 /* user quota */ > #define EXT2_MOUNT_GRPQUOTA 0x040000 /* group quota */ > #define EXT2_MOUNT_RESERVATION 0x080000 /* Preallocation */ > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index a9346a9..2e587e2 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -1393,7 +1393,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino) > > if (S_ISREG(inode->i_mode)) { > inode->i_op = &ext2_file_inode_operations; > - if (ext2_use_xip(inode->i_sb)) { > + if (test_opt(inode->i_sb, XIP)) { > inode->i_mapping->a_ops = &ext2_aops_xip; > inode->i_fop = &ext2_xip_file_operations; > } else if (test_opt(inode->i_sb, NOBH)) { > diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c > index c268d0a..846c356 100644 > --- a/fs/ext2/namei.c > +++ b/fs/ext2/namei.c > @@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode > return PTR_ERR(inode); > > inode->i_op = &ext2_file_inode_operations; > - if (ext2_use_xip(inode->i_sb)) { > + if (test_opt(inode->i_sb, XIP)) { > inode->i_mapping->a_ops = &ext2_aops_xip; > inode->i_fop = &ext2_xip_file_operations; > } else if (test_opt(inode->i_sb, NOBH)) { > @@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) > return PTR_ERR(inode); > > inode->i_op = &ext2_file_inode_operations; > - if (ext2_use_xip(inode->i_sb)) { > + if (test_opt(inode->i_sb, XIP)) { > inode->i_mapping->a_ops = &ext2_aops_xip; > inode->i_fop = &ext2_xip_file_operations; > } else if (test_opt(inode->i_sb, NOBH)) { > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 15/22] Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX Date: Wed, 9 Apr 2014 11:59:18 +0200 Message-ID: <20140409095918.GG32103@quack.suse.cz> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sun 23-03-14 15:08:41, Matthew Wilcox wrote: > The fewer Kconfig options we have the better. Use the generic > CONFIG_FS_DAX to enable XIP support in ext2 as well as in the core. > > Signed-off-by: Matthew Wilcox Looks good. You can add: Reviewed-by: Jan Kara BTW: Its really only 2KB of code? Honza > --- > fs/Kconfig | 21 ++++++++++++++------- > fs/Makefile | 2 +- > fs/ext2/Kconfig | 11 ----------- > fs/ext2/ext2.h | 2 +- > fs/ext2/file.c | 4 ++-- > fs/ext2/super.c | 4 ++-- > include/linux/fs.h | 4 ++-- > 7 files changed, 22 insertions(+), 26 deletions(-) > > diff --git a/fs/Kconfig b/fs/Kconfig > index 7385e54..620ab73 100644 > --- a/fs/Kconfig > +++ b/fs/Kconfig > @@ -13,13 +13,6 @@ if BLOCK > source "fs/ext2/Kconfig" > source "fs/ext3/Kconfig" > source "fs/ext4/Kconfig" > - > -config FS_XIP > -# execute in place > - bool > - depends on EXT2_FS_XIP > - default y > - > source "fs/jbd/Kconfig" > source "fs/jbd2/Kconfig" > > @@ -40,6 +33,20 @@ source "fs/ocfs2/Kconfig" > source "fs/btrfs/Kconfig" > source "fs/nilfs2/Kconfig" > > +config FS_DAX > + bool "Direct Access support" > + depends on MMU > + help > + Direct Access (DAX) can be used on memory-backed block devices. > + If the block device supports DAX and the filesystem supports DAX, > + then you can avoid using the pagecache to buffer I/Os. Turning > + on this option will compile in support for DAX; you will need to > + mount the filesystem using the -o xip option. > + > + If you do not have a block device that is capable of using this, > + or if unsure, say N. Saying Y will increase the size of the kernel > + by about 2kB. > + > endif # BLOCK > > # Posix ACL utility routines > diff --git a/fs/Makefile b/fs/Makefile > index 2f194cd..b7e0a13 100644 > --- a/fs/Makefile > +++ b/fs/Makefile > @@ -29,7 +29,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o > obj-$(CONFIG_TIMERFD) += timerfd.o > obj-$(CONFIG_EVENTFD) += eventfd.o > obj-$(CONFIG_AIO) += aio.o > -obj-$(CONFIG_FS_XIP) += dax.o > +obj-$(CONFIG_FS_DAX) += dax.o > obj-$(CONFIG_FILE_LOCKING) += locks.o > obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o > obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout.o > diff --git a/fs/ext2/Kconfig b/fs/ext2/Kconfig > index 14a6780..c634874e 100644 > --- a/fs/ext2/Kconfig > +++ b/fs/ext2/Kconfig > @@ -42,14 +42,3 @@ config EXT2_FS_SECURITY > > If you are not using a security module that requires using > extended attributes for file security labels, say N. > - > -config EXT2_FS_XIP > - bool "Ext2 execute in place support" > - depends on EXT2_FS && MMU > - help > - Execute in place can be used on memory-backed block devices. If you > - enable this option, you can select to mount block devices which are > - capable of this feature without using the page cache. > - > - If you do not use a block device that is capable of using this, > - or if unsure, say N. > diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h > index 5ecf570..b30c3bd 100644 > --- a/fs/ext2/ext2.h > +++ b/fs/ext2/ext2.h > @@ -380,7 +380,7 @@ struct ext2_inode { > #define EXT2_MOUNT_NO_UID32 0x000200 /* Disable 32-bit UIDs */ > #define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */ > #define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */ > -#ifdef CONFIG_FS_XIP > +#ifdef CONFIG_FS_DAX > #define EXT2_MOUNT_XIP 0x010000 /* Execute in place */ > #else > #define EXT2_MOUNT_XIP 0 > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > index e3ce10d..ae7f000 100644 > --- a/fs/ext2/file.c > +++ b/fs/ext2/file.c > @@ -25,7 +25,7 @@ > #include "xattr.h" > #include "acl.h" > > -#ifdef CONFIG_EXT2_FS_XIP > +#ifdef CONFIG_FS_DAX > static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf) > { > return dax_fault(vma, vmf, ext2_get_block); > @@ -109,7 +109,7 @@ const struct file_operations ext2_file_operations = { > .splice_write = generic_file_splice_write, > }; > > -#ifdef CONFIG_EXT2_FS_XIP > +#ifdef CONFIG_FS_DAX > const struct file_operations ext2_xip_file_operations = { > .llseek = generic_file_llseek, > .read = do_sync_read, > diff --git a/fs/ext2/super.c b/fs/ext2/super.c > index 752ccb4..fdcacf7 100644 > --- a/fs/ext2/super.c > +++ b/fs/ext2/super.c > @@ -287,7 +287,7 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root) > seq_puts(seq, ",grpquota"); > #endif > > -#if defined(CONFIG_EXT2_FS_XIP) > +#ifdef CONFIG_FS_DAX > if (sbi->s_mount_opt & EXT2_MOUNT_XIP) > seq_puts(seq, ",xip"); > #endif > @@ -549,7 +549,7 @@ static int parse_options(char *options, struct super_block *sb) > break; > #endif > case Opt_xip: > -#ifdef CONFIG_EXT2_FS_XIP > +#ifdef CONFIG_FS_DAX > set_opt (sbi->s_mount_opt, XIP); > #else > ext2_msg(sb, KERN_INFO, "xip option not supported"); > diff --git a/include/linux/fs.h b/include/linux/fs.h > index aeab3fda..bff394d 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1681,7 +1681,7 @@ struct super_operations { > #define IS_IMA(inode) ((inode)->i_flags & S_IMA) > #define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT) > #define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC) > -#ifdef CONFIG_FS_XIP > +#ifdef CONFIG_FS_DAX > #define IS_DAX(inode) ((inode)->i_flags & S_DAX) > #else > #define IS_DAX(inode) 0 > @@ -2519,7 +2519,7 @@ extern loff_t fixed_size_llseek(struct file *file, loff_t offset, > extern int generic_file_open(struct inode * inode, struct file * filp); > extern int nonseekable_open(struct inode * inode, struct file * filp); > > -#ifdef CONFIG_FS_XIP > +#ifdef CONFIG_FS_DAX > int dax_clear_blocks(struct inode *, sector_t block, long size); > int dax_truncate_page(struct inode *, loff_t from, get_block_t); > ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *, > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 14/22] ext2: Remove xip.c and xip.h Date: Wed, 9 Apr 2014 11:59:41 +0200 Message-ID: <20140409095941.GH32103@quack.suse.cz> References: <33ff0862f6d99b352429ef4494817544c3d5da68.1395591795.git.matthew.r.wilcox@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <33ff0862f6d99b352429ef4494817544c3d5da68.1395591795.git.matthew.r.wilcox@intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sun 23-03-14 15:08:40, Matthew Wilcox wrote: > These files are now empty, so delete them Looks good, you can add: Reviewed-by: Jan Kara Honza > > Signed-off-by: Matthew Wilcox > --- > fs/ext2/Makefile | 1 - > fs/ext2/inode.c | 1 - > fs/ext2/namei.c | 1 - > fs/ext2/super.c | 1 - > fs/ext2/xip.c | 15 --------------- > fs/ext2/xip.h | 16 ---------------- > 6 files changed, 35 deletions(-) > delete mode 100644 fs/ext2/xip.c > delete mode 100644 fs/ext2/xip.h > > diff --git a/fs/ext2/Makefile b/fs/ext2/Makefile > index f42af45..445b0e9 100644 > --- a/fs/ext2/Makefile > +++ b/fs/ext2/Makefile > @@ -10,4 +10,3 @@ ext2-y := balloc.o dir.o file.o ialloc.o inode.o \ > ext2-$(CONFIG_EXT2_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o > ext2-$(CONFIG_EXT2_FS_POSIX_ACL) += acl.o > ext2-$(CONFIG_EXT2_FS_SECURITY) += xattr_security.o > -ext2-$(CONFIG_EXT2_FS_XIP) += xip.o > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index 2e587e2..67124f0 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -34,7 +34,6 @@ > #include > #include "ext2.h" > #include "acl.h" > -#include "xip.h" > #include "xattr.h" > > static int __ext2_write_inode(struct inode *inode, int do_sync); > diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c > index 846c356..7ca803f 100644 > --- a/fs/ext2/namei.c > +++ b/fs/ext2/namei.c > @@ -35,7 +35,6 @@ > #include "ext2.h" > #include "xattr.h" > #include "acl.h" > -#include "xip.h" > > static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode) > { > diff --git a/fs/ext2/super.c b/fs/ext2/super.c > index 3a1db39..752ccb4 100644 > --- a/fs/ext2/super.c > +++ b/fs/ext2/super.c > @@ -35,7 +35,6 @@ > #include "ext2.h" > #include "xattr.h" > #include "acl.h" > -#include "xip.h" > > static void ext2_sync_super(struct super_block *sb, > struct ext2_super_block *es, int wait); > diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c > deleted file mode 100644 > index 66ca113..0000000 > --- a/fs/ext2/xip.c > +++ /dev/null > @@ -1,15 +0,0 @@ > -/* > - * linux/fs/ext2/xip.c > - * > - * Copyright (C) 2005 IBM Corporation > - * Author: Carsten Otte (cotte@de.ibm.com) > - */ > - > -#include > -#include > -#include > -#include > -#include > -#include "ext2.h" > -#include "xip.h" > - > diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h > deleted file mode 100644 > index 87eeb04..0000000 > --- a/fs/ext2/xip.h > +++ /dev/null > @@ -1,16 +0,0 @@ > -/* > - * linux/fs/ext2/xip.h > - * > - * Copyright (C) 2005 IBM Corporation > - * Author: Carsten Otte (cotte@de.ibm.com) > - */ > - > -#ifdef CONFIG_EXT2_FS_XIP > -static inline int ext2_use_xip (struct super_block *sb) > -{ > - struct ext2_sb_info *sbi = EXT2_SB(sb); > - return (sbi->s_mount_opt & EXT2_MOUNT_XIP); > -} > -#else > -#define ext2_use_xip(sb) 0 > -#endif > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 16/22] ext2: Remove ext2_aops_xip Date: Wed, 9 Apr 2014 12:02:33 +0200 Message-ID: <20140409100233.GI32103@quack.suse.cz> References: <0b6512aa46a504459f41d3c609fc20c93d4a911a.1395591795.git.matthew.r.wilcox@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <0b6512aa46a504459f41d3c609fc20c93d4a911a.1395591795.git.matthew.r.wilcox@intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sun 23-03-14 15:08:42, Matthew Wilcox wrote: > We shouldn't need a special address_space_operations any more > > Signed-off-by: Matthew Wilcox Looks good. You can add: Reviewed-by: Jan Kara Honza > --- > fs/ext2/ext2.h | 1 - > fs/ext2/inode.c | 7 +------ > fs/ext2/namei.c | 4 ++-- > 3 files changed, 3 insertions(+), 9 deletions(-) > > diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h > index b30c3bd..b8b1c11 100644 > --- a/fs/ext2/ext2.h > +++ b/fs/ext2/ext2.h > @@ -793,7 +793,6 @@ extern const struct file_operations ext2_xip_file_operations; > > /* inode.c */ > extern const struct address_space_operations ext2_aops; > -extern const struct address_space_operations ext2_aops_xip; > extern const struct address_space_operations ext2_nobh_aops; > > /* namei.c */ > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index 67124f0..7ca76da 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -890,11 +890,6 @@ const struct address_space_operations ext2_aops = { > .error_remove_page = generic_error_remove_page, > }; > > -const struct address_space_operations ext2_aops_xip = { > - .bmap = ext2_bmap, > - .direct_IO = ext2_direct_IO, > -}; > - > const struct address_space_operations ext2_nobh_aops = { > .readpage = ext2_readpage, > .readpages = ext2_readpages, > @@ -1393,7 +1388,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino) > if (S_ISREG(inode->i_mode)) { > inode->i_op = &ext2_file_inode_operations; > if (test_opt(inode->i_sb, XIP)) { > - inode->i_mapping->a_ops = &ext2_aops_xip; > + inode->i_mapping->a_ops = &ext2_aops; > inode->i_fop = &ext2_xip_file_operations; > } else if (test_opt(inode->i_sb, NOBH)) { > inode->i_mapping->a_ops = &ext2_nobh_aops; > diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c > index 7ca803f..0db888c 100644 > --- a/fs/ext2/namei.c > +++ b/fs/ext2/namei.c > @@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode > > inode->i_op = &ext2_file_inode_operations; > if (test_opt(inode->i_sb, XIP)) { > - inode->i_mapping->a_ops = &ext2_aops_xip; > + inode->i_mapping->a_ops = &ext2_aops; > inode->i_fop = &ext2_xip_file_operations; > } else if (test_opt(inode->i_sb, NOBH)) { > inode->i_mapping->a_ops = &ext2_nobh_aops; > @@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) > > inode->i_op = &ext2_file_inode_operations; > if (test_opt(inode->i_sb, XIP)) { > - inode->i_mapping->a_ops = &ext2_aops_xip; > + inode->i_mapping->a_ops = &ext2_aops; > inode->i_fop = &ext2_xip_file_operations; > } else if (test_opt(inode->i_sb, NOBH)) { > inode->i_mapping->a_ops = &ext2_nobh_aops; > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 17/22] Get rid of most mentions of XIP in ext2 Date: Wed, 9 Apr 2014 12:04:35 +0200 Message-ID: <20140409100435.GJ32103@quack.suse.cz> References: <0b13a744db9bfca33938bc1576f7eb7bfc9c41c2.1395591795.git.matthew.r.wilcox@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <0b13a744db9bfca33938bc1576f7eb7bfc9c41c2.1395591795.git.matthew.r.wilcox@intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sun 23-03-14 15:08:43, Matthew Wilcox wrote: > The only remaining usage is userspace's 'xip' option. Looks good. You can add: Reviewed-by: Jan Kara Honza > --- > fs/ext2/ext2.h | 6 +++--- > fs/ext2/file.c | 2 +- > fs/ext2/inode.c | 6 +++--- > fs/ext2/namei.c | 8 ++++---- > fs/ext2/super.c | 16 ++++++++-------- > 5 files changed, 19 insertions(+), 19 deletions(-) > > diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h > index b8b1c11..0e1fe9d 100644 > --- a/fs/ext2/ext2.h > +++ b/fs/ext2/ext2.h > @@ -381,9 +381,9 @@ struct ext2_inode { > #define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */ > #define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */ > #ifdef CONFIG_FS_DAX > -#define EXT2_MOUNT_XIP 0x010000 /* Execute in place */ > +#define EXT2_MOUNT_DAX 0x010000 /* Direct Access */ > #else > -#define EXT2_MOUNT_XIP 0 > +#define EXT2_MOUNT_DAX 0 > #endif > #define EXT2_MOUNT_USRQUOTA 0x020000 /* user quota */ > #define EXT2_MOUNT_GRPQUOTA 0x040000 /* group quota */ > @@ -789,7 +789,7 @@ extern int ext2_fsync(struct file *file, loff_t start, loff_t end, > int datasync); > extern const struct inode_operations ext2_file_inode_operations; > extern const struct file_operations ext2_file_operations; > -extern const struct file_operations ext2_xip_file_operations; > +extern const struct file_operations ext2_dax_file_operations; > > /* inode.c */ > extern const struct address_space_operations ext2_aops; > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > index ae7f000..f9bcb9b 100644 > --- a/fs/ext2/file.c > +++ b/fs/ext2/file.c > @@ -110,7 +110,7 @@ const struct file_operations ext2_file_operations = { > }; > > #ifdef CONFIG_FS_DAX > -const struct file_operations ext2_xip_file_operations = { > +const struct file_operations ext2_dax_file_operations = { > .llseek = generic_file_llseek, > .read = do_sync_read, > .write = do_sync_write, > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index 7ca76da..3776063 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -1285,7 +1285,7 @@ void ext2_set_inode_flags(struct inode *inode) > inode->i_flags |= S_NOATIME; > if (flags & EXT2_DIRSYNC_FL) > inode->i_flags |= S_DIRSYNC; > - if (test_opt(inode->i_sb, XIP)) > + if (test_opt(inode->i_sb, DAX)) > inode->i_flags |= S_DAX; > } > > @@ -1387,9 +1387,9 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino) > > if (S_ISREG(inode->i_mode)) { > inode->i_op = &ext2_file_inode_operations; > - if (test_opt(inode->i_sb, XIP)) { > + if (test_opt(inode->i_sb, DAX)) { > inode->i_mapping->a_ops = &ext2_aops; > - inode->i_fop = &ext2_xip_file_operations; > + inode->i_fop = &ext2_dax_file_operations; > } else if (test_opt(inode->i_sb, NOBH)) { > inode->i_mapping->a_ops = &ext2_nobh_aops; > inode->i_fop = &ext2_file_operations; > diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c > index 0db888c..148f6e3 100644 > --- a/fs/ext2/namei.c > +++ b/fs/ext2/namei.c > @@ -104,9 +104,9 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode > return PTR_ERR(inode); > > inode->i_op = &ext2_file_inode_operations; > - if (test_opt(inode->i_sb, XIP)) { > + if (test_opt(inode->i_sb, DAX)) { > inode->i_mapping->a_ops = &ext2_aops; > - inode->i_fop = &ext2_xip_file_operations; > + inode->i_fop = &ext2_dax_file_operations; > } else if (test_opt(inode->i_sb, NOBH)) { > inode->i_mapping->a_ops = &ext2_nobh_aops; > inode->i_fop = &ext2_file_operations; > @@ -125,9 +125,9 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) > return PTR_ERR(inode); > > inode->i_op = &ext2_file_inode_operations; > - if (test_opt(inode->i_sb, XIP)) { > + if (test_opt(inode->i_sb, DAX)) { > inode->i_mapping->a_ops = &ext2_aops; > - inode->i_fop = &ext2_xip_file_operations; > + inode->i_fop = &ext2_dax_file_operations; > } else if (test_opt(inode->i_sb, NOBH)) { > inode->i_mapping->a_ops = &ext2_nobh_aops; > inode->i_fop = &ext2_file_operations; > diff --git a/fs/ext2/super.c b/fs/ext2/super.c > index fdcacf7..8062373 100644 > --- a/fs/ext2/super.c > +++ b/fs/ext2/super.c > @@ -288,7 +288,7 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root) > #endif > > #ifdef CONFIG_FS_DAX > - if (sbi->s_mount_opt & EXT2_MOUNT_XIP) > + if (sbi->s_mount_opt & EXT2_MOUNT_DAX) > seq_puts(seq, ",xip"); > #endif > > @@ -393,7 +393,7 @@ enum { > Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic, > Opt_err_ro, Opt_nouid32, Opt_nocheck, Opt_debug, > Opt_oldalloc, Opt_orlov, Opt_nobh, Opt_user_xattr, Opt_nouser_xattr, > - Opt_acl, Opt_noacl, Opt_xip, Opt_ignore, Opt_err, Opt_quota, > + Opt_acl, Opt_noacl, Opt_dax, Opt_ignore, Opt_err, Opt_quota, > Opt_usrquota, Opt_grpquota, Opt_reservation, Opt_noreservation > }; > > @@ -421,7 +421,7 @@ static const match_table_t tokens = { > {Opt_nouser_xattr, "nouser_xattr"}, > {Opt_acl, "acl"}, > {Opt_noacl, "noacl"}, > - {Opt_xip, "xip"}, > + {Opt_dax, "xip"}, > {Opt_grpquota, "grpquota"}, > {Opt_ignore, "noquota"}, > {Opt_quota, "quota"}, > @@ -548,9 +548,9 @@ static int parse_options(char *options, struct super_block *sb) > "(no)acl options not supported"); > break; > #endif > - case Opt_xip: > + case Opt_dax: > #ifdef CONFIG_FS_DAX > - set_opt (sbi->s_mount_opt, XIP); > + set_opt (sbi->s_mount_opt, DAX); > #else > ext2_msg(sb, KERN_INFO, "xip option not supported"); > #endif > @@ -896,7 +896,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) > > blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size); > > - if (sbi->s_mount_opt & EXT2_MOUNT_XIP) { > + if (sbi->s_mount_opt & EXT2_MOUNT_DAX) { > if (blocksize != PAGE_SIZE) { > ext2_msg(sb, KERN_ERR, > "error: unsupported blocksize for xip"); > @@ -1275,10 +1275,10 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data) > ((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0); > > es = sbi->s_es; > - if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) { > + if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_DAX) { > ext2_msg(sb, KERN_WARNING, "warning: refusing change of " > "xip flag with busy inodes while remounting"); > - sbi->s_mount_opt ^= EXT2_MOUNT_XIP; > + sbi->s_mount_opt ^= EXT2_MOUNT_DAX; > } > if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) { > spin_unlock(&sbi->s_lock); > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 22/22] brd: Rename XIP to DAX Date: Wed, 9 Apr 2014 12:07:09 +0200 Message-ID: <20140409100709.GK32103@quack.suse.cz> References: <7fd74703525f4077ed7c2b273ce6d082b03f0b61.1395591795.git.matthew.r.wilcox@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Matthew Wilcox To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <7fd74703525f4077ed7c2b273ce6d082b03f0b61.1395591795.git.matthew.r.wilcox@intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sun 23-03-14 15:08:48, Matthew Wilcox wrote: > From: Matthew Wilcox > > Since this is relating to FS_XIP, not KERNEL_XIP, it should be called > DAX instead of XIP. > > Signed-off-by: Matthew Wilcox Looks good. You can add: Reviewed-by: Jan Kara Honza > --- > drivers/block/Kconfig | 13 +++++++------ > drivers/block/brd.c | 14 +++++++------- > fs/Kconfig | 4 ++-- > 3 files changed, 16 insertions(+), 15 deletions(-) > > diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig > index 014a1cf..1b8094d 100644 > --- a/drivers/block/Kconfig > +++ b/drivers/block/Kconfig > @@ -393,14 +393,15 @@ config BLK_DEV_RAM_SIZE > The default value is 4096 kilobytes. Only change this if you know > what you are doing. > > -config BLK_DEV_XIP > - bool "Support XIP filesystems on RAM block device" > - depends on BLK_DEV_RAM > +config BLK_DEV_RAM_DAX > + bool "Support Direct Access (DAX) to RAM block devices" > + depends on BLK_DEV_RAM && FS_DAX > default n > help > - Support XIP filesystems (such as ext2 with XIP support on) on > - top of block ram device. This will slightly enlarge the kernel, and > - will prevent RAM block device backing store memory from being > + Support filesystems using DAX to access RAM block devices. This > + avoids double-buffering data in the page cache before copying it > + to the block device. Answering Y will slightly enlarge the kernel, > + and will prevent RAM block device backing store memory from being > allocated from highmem (only a problem for highmem systems). > > config CDROM_PKTCDVD > diff --git a/drivers/block/brd.c b/drivers/block/brd.c > index 00da60d..619e0e0 100644 > --- a/drivers/block/brd.c > +++ b/drivers/block/brd.c > @@ -97,13 +97,13 @@ static struct page *brd_insert_page(struct brd_device *brd, sector_t sector) > * Must use NOIO because we don't want to recurse back into the > * block or filesystem layers from page reclaim. > * > - * Cannot support XIP and highmem, because our ->direct_access > - * routine for XIP must return memory that is always addressable. > - * If XIP was reworked to use pfns and kmap throughout, this > + * Cannot support DAX and highmem, because our ->direct_access > + * routine for DAX must return memory that is always addressable. > + * If DAX was reworked to use pfns and kmap throughout, this > * restriction might be able to be lifted. > */ > gfp_flags = GFP_NOIO | __GFP_ZERO; > -#ifndef CONFIG_BLK_DEV_XIP > +#ifndef CONFIG_BLK_DEV_RAM_DAX > gfp_flags |= __GFP_HIGHMEM; > #endif > page = alloc_page(gfp_flags); > @@ -360,7 +360,7 @@ out: > bio_endio(bio, err); > } > > -#ifdef CONFIG_BLK_DEV_XIP > +#ifdef CONFIG_BLK_DEV_RAM_DAX > static long brd_direct_access(struct block_device *bdev, sector_t sector, > void **kaddr, unsigned long *pfn, long size) > { > @@ -383,6 +383,8 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector, > * file is mapped to the next page of physical RAM */ > return PAGE_SIZE; > } > +#else > +#define brd_direct_access NULL > #endif > > static int brd_ioctl(struct block_device *bdev, fmode_t mode, > @@ -422,9 +424,7 @@ static int brd_ioctl(struct block_device *bdev, fmode_t mode, > static const struct block_device_operations brd_fops = { > .owner = THIS_MODULE, > .ioctl = brd_ioctl, > -#ifdef CONFIG_BLK_DEV_XIP > .direct_access = brd_direct_access, > -#endif > }; > > /* > diff --git a/fs/Kconfig b/fs/Kconfig > index 620ab73..376bd0a 100644 > --- a/fs/Kconfig > +++ b/fs/Kconfig > @@ -34,7 +34,7 @@ source "fs/btrfs/Kconfig" > source "fs/nilfs2/Kconfig" > > config FS_DAX > - bool "Direct Access support" > + bool "Direct Access (DAX) support" > depends on MMU > help > Direct Access (DAX) can be used on memory-backed block devices. > @@ -45,7 +45,7 @@ config FS_DAX > > If you do not have a block device that is capable of using this, > or if unsure, say N. Saying Y will increase the size of the kernel > - by about 2kB. > + by about 5kB. > > endif # BLOCK > > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 18/22] xip: Add xip_zero_page_range Date: Wed, 9 Apr 2014 12:15:12 +0200 Message-ID: <20140409101512.GL32103@quack.suse.cz> References: <5a87acda8c3e4d2b7ea5dd1249fcbf8be23b9645.1395591795.git.matthew.r.wilcox@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com, Ross Zwisler To: Matthew Wilcox Return-path: Received: from cantor2.suse.de ([195.135.220.15]:51570 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932712AbaDIKPP (ORCPT ); Wed, 9 Apr 2014 06:15:15 -0400 Content-Disposition: inline In-Reply-To: <5a87acda8c3e4d2b7ea5dd1249fcbf8be23b9645.1395591795.git.matthew.r.wilcox@intel.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Sun 23-03-14 15:08:44, Matthew Wilcox wrote: > This new function allows us to support hole-punch for XIP files by zeroing > a partial page, as opposed to the xip_truncate_page() function which can > only truncate to the end of the page. Reimplement xip_truncate_page() as > a macro that calls xip_zero_page_range(). > > Signed-off-by: Matthew Wilcox > [ported to 3.13-rc2] > Signed-off-by: Ross Zwisler Two comments below... ... > diff --git a/fs/dax.c b/fs/dax.c > index 45a0a41..2d6b4bc 100644 > --- a/fs/dax.c > +++ b/fs/dax.c ... > @@ -491,11 +494,16 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) > if (buffer_written(&bh)) { > void *addr; > err = dax_get_addr(inode, &bh, &addr); > - if (err) > + if (err < 0) > return err; > + /* > + * ext4 sometimes asks to zero past the end of a block. It > + * really just wants to zero to the end of the block. > + */ Then we should really fix ext4 I believe... > + length = min_t(unsigned, length, PAGE_CACHE_SIZE - offset); > memset(addr + offset, 0, length); > } > > return 0; > } > -EXPORT_SYMBOL_GPL(dax_truncate_page); > +EXPORT_SYMBOL_GPL(dax_zero_page_range); > diff --git a/include/linux/fs.h b/include/linux/fs.h > index bff394d..d0381ab 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2521,6 +2521,7 @@ extern int nonseekable_open(struct inode * inode, struct file * filp); > > #ifdef CONFIG_FS_DAX > int dax_clear_blocks(struct inode *, sector_t block, long size); > +int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t); > int dax_truncate_page(struct inode *, loff_t from, get_block_t); > ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *, > loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags); > @@ -2532,7 +2533,8 @@ static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz) > return 0; > } > > -static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb) > +static inline int dax_zero_page_range(struct inode *inode, loff_t from, > + unsigned len, get_block_t gb) > { > return 0; > } > @@ -2545,6 +2547,11 @@ static inline ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode, > } > #endif > > +/* Can't be a function because PAGE_CACHE_SIZE is defined in pagemap.h */ > +#define dax_truncate_page(inode, from, get_block) \ > + dax_zero_page_range(inode, from, PAGE_CACHE_SIZE, get_block) ^^^^ This should be (PAGE_CACHE_SIZE - (from & (PAGE_CACHE_SIZE - 1))), shouldn't it? > + > + > #ifdef CONFIG_BLOCK > typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode, Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Wed, 9 Apr 2014 12:27:58 +0200 Message-ID: <20140409102758.GM32103@quack.suse.cz> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org One more comment: On Sun 23-03-14 15:08:33, Matthew Wilcox wrote: > +static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > + get_block_t get_block) > +{ > + struct file *file = vma->vm_file; > + struct inode *inode = file_inode(file); > + struct address_space *mapping = file->f_mapping; > + struct page *page; > + struct buffer_head bh; > + unsigned long vaddr = (unsigned long)vmf->virtual_address; > + sector_t block; > + pgoff_t size; > + unsigned long pfn; > + int error; > + int major = 0; > + > + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; > + if (vmf->pgoff >= size) > + return VM_FAULT_SIGBUS; > + > + memset(&bh, 0, sizeof(bh)); > + block = (sector_t)vmf->pgoff << (PAGE_SHIFT - inode->i_blkbits); > + bh.b_size = PAGE_SIZE; > + > + repeat: > + page = find_get_page(mapping, vmf->pgoff); > + if (page) { > + if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) { > + page_cache_release(page); > + return VM_FAULT_RETRY; > + } > + if (unlikely(page->mapping != mapping)) { > + unlock_page(page); > + page_cache_release(page); > + goto repeat; > + } > + } > + > + error = get_block(inode, block, &bh, 0); > + if (error || bh.b_size < PAGE_SIZE) > + goto sigbus; > + > + if (!buffer_written(&bh) && !vmf->cow_page) { > + if (vmf->flags & FAULT_FLAG_WRITE) { > + error = get_block(inode, block, &bh, 1); > + count_vm_event(PGMAJFAULT); > + mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); > + major = VM_FAULT_MAJOR; > + if (error || bh.b_size < PAGE_SIZE) > + goto sigbus; > + } else { > + return dax_load_hole(mapping, page, vmf); > + } > + } > + > + /* Recheck i_size under i_mmap_mutex */ > + mutex_lock(&mapping->i_mmap_mutex); > + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; > + if (unlikely(vmf->pgoff >= size)) { > + mutex_unlock(&mapping->i_mmap_mutex); > + goto sigbus; You need to release the block you've got from the filesystem in case of error here an below. Honza > + } > + if (vmf->cow_page) { > + if (buffer_written(&bh)) > + copy_user_bh(vmf->cow_page, inode, &bh, vaddr); > + else > + clear_user_highpage(vmf->cow_page, vaddr); > + if (page) { > + unlock_page(page); > + page_cache_release(page); > + } > + /* do_cow_fault() will release the i_mmap_mutex */ > + return VM_FAULT_COWED; > + } > + > + if (buffer_unwritten(&bh) || buffer_new(&bh)) > + dax_clear_blocks(inode, bh.b_blocknr, bh.b_size); > + > + error = dax_get_pfn(inode, &bh, &pfn); > + if (error > 0) > + error = vm_insert_mixed(vma, vaddr, pfn); > + mutex_unlock(&mapping->i_mmap_mutex); > + > + if (page) { > + delete_from_page_cache(page); > + unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, > + PAGE_CACHE_SIZE, 0); > + unlock_page(page); > + page_cache_release(page); > + } > + > + if (error == -ENOMEM) > + return VM_FAULT_OOM; > + /* -EBUSY is fine, somebody else faulted on the same PTE */ > + if (error != -EBUSY) > + BUG_ON(error); > + return VM_FAULT_NOPAGE | major; > + > + sigbus: > + if (page) { > + unlock_page(page); > + page_cache_release(page); > + } > + return VM_FAULT_SIGBUS; > +} -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 06/22] Replace XIP read and write with DAX I/O Date: Wed, 9 Apr 2014 14:04:37 +0200 Message-ID: <20140409120437.GA7715@quack.suse.cz> References: <3ebe329d8713f7db4c105021a845316a47a29797.1395591795.git.matthew.r.wilcox@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <3ebe329d8713f7db4c105021a845316a47a29797.1395591795.git.matthew.r.wilcox@intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org I've noticed one more thing here: On Sun 23-03-14 15:08:32, Matthew Wilcox wrote: .... > +ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode, > + const struct iovec *iov, loff_t offset, unsigned nr_segs, > + get_block_t get_block, dio_iodone_t end_io, int flags) > +{ ... > + retval = dax_io(rw, inode, iov, offset, end, get_block, &bh); > + > + if ((flags & DIO_LOCKING) && (rw == READ)) > + mutex_unlock(&inode->i_mutex); > + > + inode_dio_done(inode); > + > + if ((retval > 0) && end_io) > + end_io(iocb, offset, retval, bh.b_private); In direct IO code, we first call end_io() callback and do inode_dio_done() only after that. Since filesystems use i_dio_count for protecting against different races, calling end_io() after inode_dio_done() can open all sorts of subtle races. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 20/22] ext4: Add DAX functionality Date: Wed, 9 Apr 2014 14:17:17 +0200 Message-ID: <20140409121717.GN32103@quack.suse.cz> References: <490bf3041f0e0633964ca84bf4fb0bb3dd999694.1395591795.git.matthew.r.wilcox@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ross Zwisler , willy@linux.intel.com To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <490bf3041f0e0633964ca84bf4fb0bb3dd999694.1395591795.git.matthew.r.wilcox@intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sun 23-03-14 15:08:46, Matthew Wilcox wrote: > From: Ross Zwisler > > This is a port of the DAX functionality found in the current version of > ext2. > > Signed-off-by: Ross Zwisler > Reviewed-by: Andreas Dilger > [heavily tweaked] > Signed-off-by: Matthew Wilcox > --- I have some comments below. > diff --git a/fs/ext4/file.c b/fs/ext4/file.c > index 1a50739..42a8ccd 100644 > --- a/fs/ext4/file.c > +++ b/fs/ext4/file.c > @@ -190,7 +190,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov, > } > } > > - if (unlikely(iocb->ki_filp->f_flags & O_DIRECT)) > + if (io_is_direct(iocb->ki_filp)) > ret = ext4_file_dio_write(iocb, iov, nr_segs, pos); > else > ret = generic_file_aio_write(iocb, iov, nr_segs, pos); > @@ -198,6 +198,27 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov, > return ret; > } > > +#ifdef CONFIG_FS_DAX > +static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf) > +{ > + return dax_fault(vma, vmf, ext4_get_block); > + /* Is this the right get_block? */ Yes, it is the right one. > +} > + > +static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > +{ > + return dax_mkwrite(vma, vmf, ext4_get_block); > +} Umm, I'm afraid it won't be this easy here. So you rely on ext4_get_block() to start a transaction for you and do the block allocation. However if the system crashes after ext4_get_block() has allocated the block and finished the transaction but before dax_mkwrite() had a chance to zero out the page, the filesystem will be referencing block with uninitialized data when the system boots again (this is a security issue for multiuser systems). What you need to do is to start a transaction here in ext4_dax_mkwrite(), call dax_mkwrite() (ext4_get_block() will notice the transaction is already started and don't start it again so you don't have to care about that), and stop the transaction after dax_mkwrite() returns. Except it's not so easy because sb_start_pagefault() locking ranks above transaction start so ext4 will really need to call into something like do_dax_fault() - I'd suggest we create dax_mkwrite() and __dax_mkwrite() similarly to how block_page_mkwrite() and __block_page_mkwrite() from fs/buffer.c do. > + > +static const struct vm_operations_struct ext4_dax_vm_ops = { > + .fault = ext4_dax_fault, > + .page_mkwrite = ext4_dax_mkwrite, > + .remap_pages = generic_file_remap_pages, > +}; > +#else > +#define ext4_dax_vm_ops ext4_file_vm_ops > +#endif > + > static const struct vm_operations_struct ext4_file_vm_ops = { > .fault = filemap_fault, > .page_mkwrite = ext4_page_mkwrite, > @@ -206,12 +227,13 @@ static const struct vm_operations_struct ext4_file_vm_ops = { > > static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) > { > - struct address_space *mapping = file->f_mapping; > - > - if (!mapping->a_ops->readpage) > - return -ENOEXEC; > file_accessed(file); > - vma->vm_ops = &ext4_file_vm_ops; > + if (IS_DAX(file_inode(file))) { > + vma->vm_ops = &ext4_dax_vm_ops; > + vma->vm_flags |= VM_MIXEDMAP; > + } else { > + vma->vm_ops = &ext4_file_vm_ops; > + } > return 0; > } > > @@ -609,6 +631,25 @@ const struct file_operations ext4_file_operations = { > .fallocate = ext4_fallocate, > }; > > +#ifdef CONFIG_FS_DAX > +const struct file_operations ext4_dax_file_operations = { > + .llseek = ext4_llseek, > + .read = do_sync_read, > + .write = do_sync_write, > + .aio_read = generic_file_aio_read, > + .aio_write = ext4_file_write, > + .unlocked_ioctl = ext4_ioctl, > +#ifdef CONFIG_COMPAT > + .compat_ioctl = ext4_compat_ioctl, > +#endif > + .mmap = ext4_file_mmap, > + .open = ext4_file_open, > + .release = ext4_release_file, > + .fsync = ext4_sync_file, > + .fallocate = ext4_fallocate, > +}; > +#endif > + > const struct inode_operations ext4_file_inode_operations = { > .setattr = ext4_setattr, > .getattr = ext4_getattr, > diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c > index 594009f..5fdb414 100644 > --- a/fs/ext4/indirect.c > +++ b/fs/ext4/indirect.c > @@ -686,15 +686,22 @@ retry: > inode_dio_done(inode); > goto locked; > } > - ret = __blockdev_direct_IO(rw, iocb, inode, > - inode->i_sb->s_bdev, iov, > - offset, nr_segs, > - ext4_get_block, NULL, NULL, 0); > + if (IS_DAX(inode)) > + ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs, > + ext4_get_block, NULL, 0); > + else > + ret = __blockdev_direct_IO(rw, iocb, inode, > + inode->i_sb->s_bdev, iov, offset, > + nr_segs, ext4_get_block, NULL, NULL, 0); > inode_dio_done(inode); > } else { > locked: > - ret = blockdev_direct_IO(rw, iocb, inode, iov, > - offset, nr_segs, ext4_get_block); > + if (IS_DAX(inode)) > + ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs, > + ext4_get_block, NULL, DIO_LOCKING); > + else > + ret = blockdev_direct_IO(rw, iocb, inode, iov, > + offset, nr_segs, ext4_get_block); > > if (unlikely((rw & WRITE) && ret < 0)) { > loff_t isize = i_size_read(inode); > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index ce7341c..9462730 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3140,13 +3140,14 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb, > get_block_func = ext4_get_block_write; > dio_flags = DIO_LOCKING; > } > - ret = __blockdev_direct_IO(rw, iocb, inode, > - inode->i_sb->s_bdev, iov, > - offset, nr_segs, > - get_block_func, > - ext4_end_io_dio, > - NULL, > - dio_flags); > + if (IS_DAX(inode)) > + ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs, > + get_block_func, ext4_end_io_dio, dio_flags); > + else > + ret = __blockdev_direct_IO(rw, iocb, inode, > + inode->i_sb->s_bdev, iov, offset, > + nr_segs, get_block_func, > + ext4_end_io_dio, NULL, dio_flags); > Since you don't do real AIO for DAX, you could handle async iocbs for DAX inodes the same way as normal sync iocbs (i.e., you don't need to allocate ioend and do completion from a workqueue but handle everything necessary in ext4_ext_direct_IO()). That will be noticeably faster and with smaller CPU load as well. I'm not saying you have to do that now (although it shouldn't be complicated) but at least note that in a comment please. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH v7 06/22] Replace XIP read and write with DAX I/O Date: Wed, 9 Apr 2014 11:19:08 -0400 Message-ID: <20140409151908.GD5727@linux.intel.com> References: <3ebe329d8713f7db4c105021a845316a47a29797.1395591795.git.matthew.r.wilcox@intel.com> <20140408175600.GE2713@quack.suse.cz> <20140408202102.GB5727@linux.intel.com> <20140409091450.GA32103@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Jan Kara Return-path: Content-Disposition: inline In-Reply-To: <20140409091450.GA32103@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed, Apr 09, 2014 at 11:14:50AM +0200, Jan Kara wrote: > On Tue 08-04-14 16:21:02, Matthew Wilcox wrote: > > On Tue, Apr 08, 2014 at 07:56:00PM +0200, Jan Kara wrote: > > > > +static void dax_new_buf(void *addr, unsigned size, unsigned first, > > > > + loff_t offset, loff_t end, int rw) > > > > +{ > > > > + loff_t final = end - offset + first; /* The final byte of the buffer */ > > > > + if (rw != WRITE) { > > > > + memset(addr, 0, size); > > > > + return; > > > > + } > > > It seems counterintuitive to zero out "on-disk" blocks (it seems you'd do > > > this for unwritten blocks) when reading from them. Presumably it could also > > > have undesired effects on endurance of persistent memory. Instead I'd expect > > > that you simply zero out user provided buffer the same way as you do it for > > > holes. > > > > I think we have to zero it here, because the second time we call > > get_block() for a given block, it won't be BH_New any more, so we won't > > know that it's supposed to be zeroed. > But how can you have BH_New buffer when you didn't ask get_blocks() to > create any block? That would be a bug in the get_blocks() implementation... > Or am I missing something? Oh ... right. So just to be clear, we're looking at the case where we're doing a read of a filesystem block which is BH_Unwritten, but isn't a hole ... so it's been allocated on storage and not yet written. That's already treated as a hole: if (rw == WRITE) { ... } else { hole = !buffer_written(bh); } and dax_new_buf is only called in the !hole case. > OK, but there are filesystems which do the same thing as ext4 (e.g. > btrfs) and historically noone really cared. E.g. direct IO code advances > only by a single block regardless of what filesystem returns when the > buffer is unmapped. As you correctly mention, get_blocks() API isn't really > documented so noone has really defined what should happen when you ask > filesystem to map some blocks and there's a hole. I agree what XFS does > looks sensible and ext4 can do the same. Hopefully this gets cleaned up > when Dave finishes his new block mapping interface. I hope so too! The get_block() API has been the bane of my existance since Christmas :-) > This wouldn't quite work because even ext4_map_blocks() doesn't bother to > fill in 'map' when it finds a hole. But it won't be complicated to > propagate the information. Good point. > > It'll be kind of tricky to move it because 'len' is not necessarily > > a multiple of i_blkbits, so we can't necessarily maintain b_blocknr > > accurately. > Yeah, after I understood the code I also understood why you do it the way > you did. But we could do something like: > ... > + if (!len) > + break; > + > blocks = ((offset + len) >> inode->i_blkbits) - > (offset >> inode->i_blkbits); > bh->b_blocknr += blocks; > bh->b_size -= blocks << inode->i_blkbits; > + offset += len; > + copied += len; > + addr += len; > ... We could ... I'm not sure it's simpler though. > BTW: it might be good to store inode->i_blkbits in a local variable. It > makes some expressions shorter. Yes, good idea. Done. > BTW2: although direct IO uses 'offset' for position in file, the rest of > VFS uses 'pos' for that and that seems to be less overloaded term so for me > it would be easier if you used 'pos' instead of 'offset'. Just a > suggestion. Sure. Done. > > > > + if (rw == WRITE) { > > > > + if (!buffer_mapped(bh)) { > > > > + retval = -EIO; > > > > + break; > > > -EIO looks like a wrong error here. Or maybe it is the right one and it > > > only needs some explanation? The thing is that for direct IO some > > > filesystems choose not to fill holes for direct IO and fall back to > > > buffered IO instead (to avoid exposure of uninitialized blocks if the > > > system crashes after blocks have been added to a file but before they were > > > written out). For DAX you are pretty much free to define what you ask from > > > the get_blocks() (and this fallback behavior is somewhat disputed behavior > > > in direct IO case so you might want to differ here) but you should document > > > it somewhere. > > > > Hmm ... I thought that calling get_block() with the create argument would > > force the return of a bh with the Mapped bit set. Did I misunderstand that > > aspect of the undocumented get_block() API too? > As you mention the API is undocumented and not really designed. So > filesystems do whatever causes the generic code to do what they want (it's > a mess I know). In this case, I'm warning you there are filesystems which > refuse to fill in holes from the get_blocks() function passed to > blockdev_direct_IO() (even ext4 does this for inodes with old > indirect-block based on disk format). You can just define DAX fails > horribly in these case and I'm fine with that at least in this stage. If > someone bothers later, fallback to buffered IO can be implemented. But we > should document this somewhere. Urgh. Yeah, we should probably fall back to buffered I/O for that case. I'll stick a comment in dax.c for now, and we can fix it later. > > > > + if ((flags & DIO_LOCKING) && (rw == READ)) { > > > > + struct address_space *mapping = inode->i_mapping; > > > > + mutex_lock(&inode->i_mutex); > > > > + retval = filemap_write_and_wait_range(mapping, offset, end - 1); > > > > + if (retval) { > > > > + mutex_unlock(&inode->i_mutex); > > > > + goto out; > > > > + } > > > Is there a reason for this? I'd assume DAX has no pages in pagecache... > > > > There will be pages in the page cache for holes that we page faulted on. > > They must go! :-) > Well, but this will only writeback dirty pages and if I read the code > correctly those pages will never be dirty since dax_mkwrite() will replace > them. Or am I missing something? In addition to writing back dirty pages, filemap_write_and_wait_range() will evict clean pages. Unintuitive, I know, but it matches what the direct I/O path does. Plus, if we fall back to buffered I/O for holes (see above), then this will do the right thing at that time. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Wed, 9 Apr 2014 16:48:06 -0400 Message-ID: <20140409204806.GF5727@linux.intel.com> References: <20140408220525.GC26019@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Jan Kara Return-path: Content-Disposition: inline In-Reply-To: <20140408220525.GC26019@quack.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org On Wed, Apr 09, 2014 at 12:05:25AM +0200, Jan Kara wrote: > > + if (!page) > > + return VM_FAULT_OOM; > > + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; > > + if (vmf->pgoff >= size) { > Maybe comment here that we have to recheck i_size so that we don't create > pages in the area truncate_pagecache() has already evicted. Done. > > + dax_get_addr(inode, bh, &vfrom); /* XXX: error handling */ > The error handling here is missing as the comment suggests :) Added. > > + if (buffer_unwritten(&bh) || buffer_new(&bh)) > > + dax_clear_blocks(inode, bh.b_blocknr, bh.b_size); > Where is dax_clear_blocks() defined? Er ... patch 11. I'll reorder the patches ;-) > > + > > + error = dax_get_pfn(inode, &bh, &pfn); > > + if (error > 0) > > + error = vm_insert_mixed(vma, vaddr, pfn); > When there's a hole (thus page != NULL) and we are called from > dax_mkwrite(), this will always return EBUSY, correct? Erm ... it will return -EBUSY if this was the task that previously faulted on it. Drat. See below. > > + mutex_unlock(&mapping->i_mmap_mutex); > > + > > + if (page) { > > + delete_from_page_cache(page); > > + unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, > > + PAGE_CACHE_SIZE, 0); > Here we unmap the PTE pointing to the hole page but then we'll have to > retry the fault again to fill in the pfn we've got? This seems wrong. I'd > say we want to remap the PTE from the hole page to a pfn we've got while > holding i_mmap_mutex. remap_pfn_range() almost does what you need, except > that you also need that to work for normal pages. So you might need to > create a new helper in mm layer for that. I think it's easier than that. How does this look? @@ -390,9 +389,8 @@ static int do_dax_fault(struct vm_area_struct *vma, struct v dax_clear_blocks(inode, bh.b_blocknr, bh.b_size); error = dax_get_pfn(&bh, &pfn, blkbits); - if (error > 0) - error = vm_insert_mixed(vma, vaddr, pfn); - mutex_unlock(&mapping->i_mmap_mutex); + if (error <= 0) + goto unlock; if (page) { delete_from_page_cache(page); @@ -402,6 +400,9 @@ static int do_dax_fault(struct vm_area_struct *vma, struct v page_cache_release(page); } + error = vm_insert_mixed(vma, vaddr, pfn); + mutex_unlock(&mapping->i_mmap_mutex); + if (error == -ENOMEM) return VM_FAULT_OOM; /* -EBUSY is fine, somebody else faulted on the same PTE */ @@ -409,6 +410,8 @@ static int do_dax_fault(struct vm_area_struct *vma, struct v BUG_ON(error); return VM_FAULT_NOPAGE | major; + unlock: + mutex_unlock(&mapping->i_mmap_mutex); sigbus: if (page) { unlock_page(page); > > +int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > > + get_block_t get_block) > > +{ > > + int result; > > + struct super_block *sb = file_inode(vma->vm_file)->i_sb; > > + > > + sb_start_pagefault(sb); > You don't need any filesystem freeze protection for the fault handler > since that's not going to modify the filesystem. Err ... we might allocate a block as a result of doing a write to a hole. Or does that not count as 'modifying the filesystem' in this context? > > + file_update_time(vma->vm_file); > Why do you update m/ctime? We are only reading the file... ... except that it might be a write fault. I think we modify the file iff we return VM_FAULT_MAJOR from do_dax_fault(). So I'd be open to something like this: sb_start_pagefault(sb); result = do_dax_fault(vma, vmf, get_block); if (result & VM_FAULT_MAJOR) file_update_time(vma->vm_file); sb_end_pagefault(sb); Would that work better for you? > > @@ -70,7 +101,7 @@ const struct file_operations ext2_file_operations = { > > #ifdef CONFIG_COMPAT > > .compat_ioctl = ext2_compat_ioctl, > > #endif > > - .mmap = generic_file_mmap, > > + .mmap = ext2_file_mmap, > So what's the point of ext2_file_operations ever handling IS_DAX() > inodes? Actually ext2_file_operations and ext2_xip_file_operations seem to > be the same after this patch so either you drop ext2_xip_file_operations > (I'm for this) or you can leave generic_file_mmap here and assume > ext2_file_mmap is always called for IS_DAX() inodes. The goal is to get them the same. At this point, the only sticky point is: .splice_read = generic_file_splice_read, .splice_write = generic_file_splice_write, And splice is pretty damn sticky for DAX. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Wed, 9 Apr 2014 16:51:11 -0400 Message-ID: <20140409205111.GG5727@linux.intel.com> References: <20140409102758.GM32103@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Jan Kara Return-path: Content-Disposition: inline In-Reply-To: <20140409102758.GM32103@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed, Apr 09, 2014 at 12:27:58PM +0200, Jan Kara wrote: > > + if (unlikely(vmf->pgoff >= size)) { > > + mutex_unlock(&mapping->i_mmap_mutex); > > + goto sigbus; > You need to release the block you've got from the filesystem in case of > error here an below. What's the API to do that? Call inode->i_op->setattr()? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 06/22] Replace XIP read and write with DAX I/O Date: Wed, 9 Apr 2014 22:55:29 +0200 Message-ID: <20140409205529.GO32103@quack.suse.cz> References: <3ebe329d8713f7db4c105021a845316a47a29797.1395591795.git.matthew.r.wilcox@intel.com> <20140408175600.GE2713@quack.suse.cz> <20140408202102.GB5727@linux.intel.com> <20140409091450.GA32103@quack.suse.cz> <20140409151908.GD5727@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <20140409151908.GD5727@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed 09-04-14 11:19:08, Matthew Wilcox wrote: > On Wed, Apr 09, 2014 at 11:14:50AM +0200, Jan Kara wrote: > > On Tue 08-04-14 16:21:02, Matthew Wilcox wrote: > > > On Tue, Apr 08, 2014 at 07:56:00PM +0200, Jan Kara wrote: > > > > > +static void dax_new_buf(void *addr, unsigned size, unsigned first, > > > > > + loff_t offset, loff_t end, int rw) > > > > > +{ > > > > > + loff_t final = end - offset + first; /* The final byte of the buffer */ > > > > > + if (rw != WRITE) { > > > > > + memset(addr, 0, size); > > > > > + return; > > > > > + } > > > > It seems counterintuitive to zero out "on-disk" blocks (it seems you'd do > > > > this for unwritten blocks) when reading from them. Presumably it could also > > > > have undesired effects on endurance of persistent memory. Instead I'd expect > > > > that you simply zero out user provided buffer the same way as you do it for > > > > holes. > > > > > > I think we have to zero it here, because the second time we call > > > get_block() for a given block, it won't be BH_New any more, so we won't > > > know that it's supposed to be zeroed. > > But how can you have BH_New buffer when you didn't ask get_blocks() to > > create any block? That would be a bug in the get_blocks() implementation... > > Or am I missing something? > > Oh ... right. So just to be clear, we're looking at the case where > we're doing a read of a filesystem block which is BH_Unwritten, but > isn't a hole ... so it's been allocated on storage and not yet written. > That's already treated as a hole: > > if (rw == WRITE) { > ... > } else { > hole = !buffer_written(bh); > } > > and dax_new_buf is only called in the !hole case. Ah, my bad. But then dax_new_buf() won't ever be called for rw != WRITE. get_blocks() cannot ever return BH_New buffer when 'create' argument was 0. > > > > > + if ((flags & DIO_LOCKING) && (rw == READ)) { > > > > > + struct address_space *mapping = inode->i_mapping; > > > > > + mutex_lock(&inode->i_mutex); > > > > > + retval = filemap_write_and_wait_range(mapping, offset, end - 1); > > > > > + if (retval) { > > > > > + mutex_unlock(&inode->i_mutex); > > > > > + goto out; > > > > > + } > > > > Is there a reason for this? I'd assume DAX has no pages in pagecache... > > > > > > There will be pages in the page cache for holes that we page faulted on. > > > They must go! :-) > > Well, but this will only writeback dirty pages and if I read the code > > correctly those pages will never be dirty since dax_mkwrite() will replace > > them. Or am I missing something? > > In addition to writing back dirty pages, filemap_write_and_wait_range() > will evict clean pages. Unintuitive, I know, but it matches what the > direct I/O path does. Plus, if we fall back to buffered I/O for holes > (see above), then this will do the right thing at that time. Ugh, I'm pretty certain filemap_write_and_wait_range() doesn't evict anything ;). Direct IO path calls that function so that direct IO read after buffered write returns the written data. In that case we don't evict anything from page cache because direct IO read doesn't invalidate any information we have cached. Only direct IO write does that and for that we call invalidate_inode_pages2_range() after writing the pages. So I maintain that what you do doesn't make sense to me. You might need to do some invalidation of hole pages. But note that generic_file_direct_write() does that for you and even though that isn't serialized in any way with page faults which can instantiate the hole pages again, things should work out fine for you since that function also invalidates the range again after ->direct_IO callback is done. So AFAICT you don't have to do anything except writing some nice comment about this ;). Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Wed, 9 Apr 2014 23:12:03 +0200 Message-ID: <20140409211203.GP32103@quack.suse.cz> References: <20140408220525.GC26019@quack.suse.cz> <20140409204806.GF5727@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <20140409204806.GF5727@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed 09-04-14 16:48:06, Matthew Wilcox wrote: > On Wed, Apr 09, 2014 at 12:05:25AM +0200, Jan Kara wrote: > > > + > > > + error = dax_get_pfn(inode, &bh, &pfn); > > > + if (error > 0) > > > + error = vm_insert_mixed(vma, vaddr, pfn); > > When there's a hole (thus page != NULL) and we are called from > > dax_mkwrite(), this will always return EBUSY, correct? > > Erm ... it will return -EBUSY if this was the task that previously > faulted on it. Drat. See below. > > > > + mutex_unlock(&mapping->i_mmap_mutex); > > > + > > > + if (page) { > > > + delete_from_page_cache(page); > > > + unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, > > > + PAGE_CACHE_SIZE, 0); > > Here we unmap the PTE pointing to the hole page but then we'll have to > > retry the fault again to fill in the pfn we've got? This seems wrong. I'd > > say we want to remap the PTE from the hole page to a pfn we've got while > > holding i_mmap_mutex. remap_pfn_range() almost does what you need, except > > that you also need that to work for normal pages. So you might need to > > create a new helper in mm layer for that. > > I think it's easier than that. How does this look? > > @@ -390,9 +389,8 @@ static int do_dax_fault(struct vm_area_struct *vma, struct v > dax_clear_blocks(inode, bh.b_blocknr, bh.b_size); > > error = dax_get_pfn(&bh, &pfn, blkbits); > - if (error > 0) > - error = vm_insert_mixed(vma, vaddr, pfn); > - mutex_unlock(&mapping->i_mmap_mutex); > + if (error <= 0) > + goto unlock; > > if (page) { > delete_from_page_cache(page); > @@ -402,6 +400,9 @@ static int do_dax_fault(struct vm_area_struct *vma, struct v > page_cache_release(page); > } > > + error = vm_insert_mixed(vma, vaddr, pfn); > + mutex_unlock(&mapping->i_mmap_mutex); > + This would be fine except that unmap_mapping_range() grabs i_mmap_mutex again :-|. But it might be easier to provide a version of that function which assumes i_mmap_mutex is already locked than what I was suggesting. > if (error == -ENOMEM) > return VM_FAULT_OOM; > /* -EBUSY is fine, somebody else faulted on the same PTE */ > @@ -409,6 +410,8 @@ static int do_dax_fault(struct vm_area_struct *vma, struct v > BUG_ON(error); > return VM_FAULT_NOPAGE | major; > > + unlock: > + mutex_unlock(&mapping->i_mmap_mutex); > sigbus: > if (page) { > unlock_page(page); > > > > > +int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > > > + get_block_t get_block) > > > +{ > > > + int result; > > > + struct super_block *sb = file_inode(vma->vm_file)->i_sb; > > > + > > > + sb_start_pagefault(sb); > > You don't need any filesystem freeze protection for the fault handler > > since that's not going to modify the filesystem. > > Err ... we might allocate a block as a result of doing a write to a hole. > Or does that not count as 'modifying the filesystem' in this context? Ah, it does. But it would be nice to avoid doing sb_start_pagefault() if it's not a write fault - because you don't want to block reading from a frozen filesystem (imagine what would happen when you freeze your root filesystem to do a snapshot...). I have somewhat a mindset of standard pagecache mmap where filemap_fault() only reads in data regardless of FAULT_FLAG_WRITE setting so I was confused by your difference :). > > > + file_update_time(vma->vm_file); > > Why do you update m/ctime? We are only reading the file... > > ... except that it might be a write fault. I think we modify the file > iff we return VM_FAULT_MAJOR from do_dax_fault(). So I'd be open to > something like this: > > sb_start_pagefault(sb); > result = do_dax_fault(vma, vmf, get_block); > if (result & VM_FAULT_MAJOR) > file_update_time(vma->vm_file); > sb_end_pagefault(sb); > > Would that work better for you? Definitely. It's also a performance thing BTW - updating time stamps is relatively expensive for journalling filesystems - you have to start a transaction, add block with inode to the journal, stop a transaction - not something you want to do unless you have to. > > > @@ -70,7 +101,7 @@ const struct file_operations ext2_file_operations = { > > > #ifdef CONFIG_COMPAT > > > .compat_ioctl = ext2_compat_ioctl, > > > #endif > > > - .mmap = generic_file_mmap, > > > + .mmap = ext2_file_mmap, > > So what's the point of ext2_file_operations ever handling IS_DAX() > > inodes? Actually ext2_file_operations and ext2_xip_file_operations seem to > > be the same after this patch so either you drop ext2_xip_file_operations > > (I'm for this) or you can leave generic_file_mmap here and assume > > ext2_file_mmap is always called for IS_DAX() inodes. > > The goal is to get them the same. At this point, the only sticky point is: > > .splice_read = generic_file_splice_read, > .splice_write = generic_file_splice_write, > > And splice is pretty damn sticky for DAX. Yes, I have figured that out later. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Wed, 9 Apr 2014 23:43:31 +0200 Message-ID: <20140409214331.GQ32103@quack.suse.cz> References: <20140409102758.GM32103@quack.suse.cz> <20140409205111.GG5727@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <20140409205111.GG5727@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed 09-04-14 16:51:11, Matthew Wilcox wrote: > On Wed, Apr 09, 2014 at 12:27:58PM +0200, Jan Kara wrote: > > > + if (unlikely(vmf->pgoff >= size)) { > > > + mutex_unlock(&mapping->i_mmap_mutex); > > > + goto sigbus; > > You need to release the block you've got from the filesystem in case of > > error here an below. > > What's the API to do that? Call inode->i_op->setattr()? That's a great question. Yes, ->setattr() is the only API you have for that but you cannot use that because of locking constraints (it needs i_mutex and that's not possible to get in the fault path). Let me read again what the handler does... So there are three places that can fail after we allocate the block: 1) We race with truncate reducing i_size 2) dax_get_pfn() fails 3) vm_insert_mixed() fails I would guess that 2) can fail only if the HW has problems and leaking block in that case could be acceptable (please correct me if I'm wrong). 3) shouldn't fail because of ENOMEM because fault has already allocated all the page tables and EBUSY should be handled as well. So the only failure we have to care about is 1). And we could move ->get_block() call under i_mmap_mutex after the i_size check. Lock ordering should be fine because i_mmap_mutex ranks above page lock under which we do block mapping in standard ->page_mkwrite callbacks. The only (big) drawback is that i_mmap_mutex will now be held for much longer time and thus the contention would be much higher. But hopefully once we resolve our problems with mmap_sem and introduce mapping range lock we could scale reasonably. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH v7 11/22] Replace ext2_clear_xip_target with dax_clear_blocks Date: Thu, 10 Apr 2014 10:16:30 -0400 Message-ID: <20140410141630.GH5727@linux.intel.com> References: <20140409094644.GD32103@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Jan Kara Return-path: Content-Disposition: inline In-Reply-To: <20140409094644.GD32103@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed, Apr 09, 2014 at 11:46:44AM +0200, Jan Kara wrote: > Another day, some more review ;) Comments below. I'm really grateful for all this review! It's killing me, though ;-) > > +int dax_clear_blocks(struct inode *inode, sector_t block, long size) > > +{ > > + struct block_device *bdev = inode->i_sb->s_bdev; > > + const struct block_device_operations *ops = bdev->bd_disk->fops; > > + sector_t sector = block << (inode->i_blkbits - 9); > > + unsigned long pfn; > > + > > + might_sleep(); > > + do { > > + void *addr; > > + long count = ops->direct_access(bdev, sector, &addr, &pfn, > > + size); > So do you assume blocksize == PAGE_SIZE here? If not, addr could be in > the middle of the page AFAICT. You're right. Depending on how clear_page() is implemented, that might go badly wrong. Of course, both ext2 & ext4 require block_size == PAGE_SIZE right now, so anything else is by definition untested. I've been trying to keep DAX free from that assumption, but obviously haven't caught all the places. How does this look? typedef long (*direct_access_t)(struct block_device *, sector_t, void **, unsigned long *pfn, long size); int dax_clear_blocks(struct inode *inode, sector_t block, long size) { struct block_device *bdev = inode->i_sb->s_bdev; direct_access_t direct_access = bdev->bd_disk->fops->direct_access; sector_t sector = block << (inode->i_blkbits - 9); unsigned long pfn; might_sleep(); do { void *addr; long count = direct_access(bdev, sector, &addr, &pfn, size); if (count < 0) return count; while (count > 0) { unsigned pgsz = PAGE_SIZE - offset_in_page(addr); if (pgsz > count) pgsz = count; if (pgsz < PAGE_SIZE) memset(addr, 0, pgsz); else clear_page(addr); addr += pgsz; size -= pgsz; count -= pgsz; sector += pgsz / 512; cond_resched(); } } while (size); return 0; } EXPORT_SYMBOL_GPL(dax_clear_blocks); > > if (IS_DAX(inode)) { > > /* > > - * we need to clear the block > > + * block must be initialised before we put it in the tree > > + * so that it's not found by another thread before it's > > + * initialised > > */ > > - err = ext2_clear_xip_target (inode, > > - le32_to_cpu(chain[depth-1].key)); > > + err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key), > > + count << inode->i_blkbits); > Umm 'count' looks wrong here. You want to clear only one block, don't > you? I think I got confused between ext2 and ext4 here. I do want to clear only one block. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH v7 15/22] Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX Date: Thu, 10 Apr 2014 10:23:54 -0400 Message-ID: <20140410142354.GJ5727@linux.intel.com> References: <20140409095918.GG32103@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Jan Kara Return-path: Content-Disposition: inline In-Reply-To: <20140409095918.GG32103@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed, Apr 09, 2014 at 11:59:18AM +0200, Jan Kara wrote: > On Sun 23-03-14 15:08:41, Matthew Wilcox wrote: > > The fewer Kconfig options we have the better. Use the generic > > CONFIG_FS_DAX to enable XIP support in ext2 as well as in the core. > > > > Signed-off-by: Matthew Wilcox > Looks good. You can add: > Reviewed-by: Jan Kara > > BTW: Its really only 2KB of code? I changed it in a later patch ... it's about 5kB of code. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH v7 12/22] ext2: Remove ext2_xip_verify_sb() Date: Thu, 10 Apr 2014 10:22:54 -0400 Message-ID: <20140410142254.GI5727@linux.intel.com> References: <5f91cb658e1ee1b593be9fd719e8f204b0069031.1395591795.git.matthew.r.wilcox@intel.com> <20140409095254.GE32103@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Jan Kara Return-path: Received: from mga09.intel.com ([134.134.136.24]:45650 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934688AbaDJOYP (ORCPT ); Thu, 10 Apr 2014 10:24:15 -0400 Content-Disposition: inline In-Reply-To: <20140409095254.GE32103@quack.suse.cz> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Wed, Apr 09, 2014 at 11:52:54AM +0200, Jan Kara wrote: > > - if ((sbi->s_mount_opt ^ old_mount_opt) & EXT2_MOUNT_XIP) { > > + if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) { > > ext2_msg(sb, KERN_WARNING, "warning: refusing change of " > > "xip flag with busy inodes while remounting"); > > - sbi->s_mount_opt &= ~EXT2_MOUNT_XIP; > > - sbi->s_mount_opt |= old_mount_opt & EXT2_MOUNT_XIP; > > + sbi->s_mount_opt ^= EXT2_MOUNT_XIP; > Although this is correct, it was easier to see that the previous code is > correct so I'd prefer if you kept it that way. Depends how you think about it. I think of foo ^= bar as 'toggle the bar bit in foo'. So I read the code as 'If the mount bit is incorrect, print an error and toggle the bit'. I think you're reading the old code as 'If the new mount bit differs from the old mount bit, make sure the new mount bit is the same as the old mount bit'. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH v7 17/22] Get rid of most mentions of XIP in ext2 Date: Thu, 10 Apr 2014 10:26:25 -0400 Message-ID: <20140410142625.GK5727@linux.intel.com> References: <0b13a744db9bfca33938bc1576f7eb7bfc9c41c2.1395591795.git.matthew.r.wilcox@intel.com> <20140409100435.GJ32103@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Jan Kara Return-path: Received: from mga03.intel.com ([143.182.124.21]:49815 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751225AbaDJO0a (ORCPT ); Thu, 10 Apr 2014 10:26:30 -0400 Content-Disposition: inline In-Reply-To: <20140409100435.GJ32103@quack.suse.cz> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Wed, Apr 09, 2014 at 12:04:35PM +0200, Jan Kara wrote: > On Sun 23-03-14 15:08:43, Matthew Wilcox wrote: > > The only remaining usage is userspace's 'xip' option. > Looks good. You can add: > Reviewed-by: Jan Kara I've been thinking about this patch, and I'm not happy with it any more :-) I want to migrate people away from using 'xip' to 'dax' without breaking anybody's scripts. So I'm thinking about adding a new 'dax' option and having the 'xip' option print a warning and force-enable the 'dax' option. That way people who might have scripts to look for 'xip' in /proc/mounts won't break. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH v7 18/22] xip: Add xip_zero_page_range Date: Thu, 10 Apr 2014 10:27:29 -0400 Message-ID: <20140410142729.GL5727@linux.intel.com> References: <5a87acda8c3e4d2b7ea5dd1249fcbf8be23b9645.1395591795.git.matthew.r.wilcox@intel.com> <20140409101512.GL32103@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ross Zwisler To: Jan Kara Return-path: Content-Disposition: inline In-Reply-To: <20140409101512.GL32103@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed, Apr 09, 2014 at 12:15:12PM +0200, Jan Kara wrote: > > + /* > > + * ext4 sometimes asks to zero past the end of a block. It > > + * really just wants to zero to the end of the block. > > + */ > Then we should really fix ext4 I believe... Since I didn't want to do this ... > > +/* Can't be a function because PAGE_CACHE_SIZE is defined in pagemap.h */ > > +#define dax_truncate_page(inode, from, get_block) \ > > + dax_zero_page_range(inode, from, PAGE_CACHE_SIZE, get_block) > ^^^^ > This should be (PAGE_CACHE_SIZE - (from & (PAGE_CACHE_SIZE - 1))), shouldn't it? ... I could get away without doing that ;-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 11/22] Replace ext2_clear_xip_target with dax_clear_blocks Date: Thu, 10 Apr 2014 20:31:04 +0200 Message-ID: <20140410183104.GA8060@quack.suse.cz> References: <20140409094644.GD32103@quack.suse.cz> <20140410141630.GH5727@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <20140410141630.GH5727@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Thu 10-04-14 10:16:30, Matthew Wilcox wrote: > On Wed, Apr 09, 2014 at 11:46:44AM +0200, Jan Kara wrote: > > Another day, some more review ;) Comments below. > > I'm really grateful for all this review! It's killing me, though ;-) Yeah, I know that feeling. :) > > > +int dax_clear_blocks(struct inode *inode, sector_t block, long size) > > > +{ > > > + struct block_device *bdev = inode->i_sb->s_bdev; > > > + const struct block_device_operations *ops = bdev->bd_disk->fops; > > > + sector_t sector = block << (inode->i_blkbits - 9); > > > + unsigned long pfn; > > > + > > > + might_sleep(); > > > + do { > > > + void *addr; > > > + long count = ops->direct_access(bdev, sector, &addr, &pfn, > > > + size); > > So do you assume blocksize == PAGE_SIZE here? If not, addr could be in > > the middle of the page AFAICT. > > You're right. Depending on how clear_page() is implemented, that > might go badly wrong. Of course, both ext2 & ext4 require block_size > == PAGE_SIZE right now, so anything else is by definition untested. > I've been trying to keep DAX free from that assumption, but obviously > haven't caught all the places. > > How does this look? That looks fine. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 12/22] ext2: Remove ext2_xip_verify_sb() Date: Thu, 10 Apr 2014 20:35:26 +0200 Message-ID: <20140410183526.GB8060@quack.suse.cz> References: <5f91cb658e1ee1b593be9fd719e8f204b0069031.1395591795.git.matthew.r.wilcox@intel.com> <20140409095254.GE32103@quack.suse.cz> <20140410142254.GI5727@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <20140410142254.GI5727@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Thu 10-04-14 10:22:54, Matthew Wilcox wrote: > On Wed, Apr 09, 2014 at 11:52:54AM +0200, Jan Kara wrote: > > > - if ((sbi->s_mount_opt ^ old_mount_opt) & EXT2_MOUNT_XIP) { > > > + if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) { > > > ext2_msg(sb, KERN_WARNING, "warning: refusing change of " > > > "xip flag with busy inodes while remounting"); > > > - sbi->s_mount_opt &= ~EXT2_MOUNT_XIP; > > > - sbi->s_mount_opt |= old_mount_opt & EXT2_MOUNT_XIP; > > > + sbi->s_mount_opt ^= EXT2_MOUNT_XIP; > > Although this is correct, it was easier to see that the previous code is > > correct so I'd prefer if you kept it that way. > > Depends how you think about it. I think of foo ^= bar as 'toggle the > bar bit in foo'. So I read the code as 'If the mount bit is incorrect, > print an error and toggle the bit'. I think you're reading the old code > as 'If the new mount bit differs from the old mount bit, make sure the > new mount bit is the same as the old mount bit'. Yeah, since it's pretty obvious what the code should do, one can figure out it is correct relatively quickly. But it's something that wasn't obvious to me at the first sight. If you really prefer your way, I can live with that... Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 17/22] Get rid of most mentions of XIP in ext2 Date: Thu, 10 Apr 2014 20:40:10 +0200 Message-ID: <20140410184010.GC8060@quack.suse.cz> References: <0b13a744db9bfca33938bc1576f7eb7bfc9c41c2.1395591795.git.matthew.r.wilcox@intel.com> <20140409100435.GJ32103@quack.suse.cz> <20140410142625.GK5727@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <20140410142625.GK5727@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Thu 10-04-14 10:26:25, Matthew Wilcox wrote: > On Wed, Apr 09, 2014 at 12:04:35PM +0200, Jan Kara wrote: > > On Sun 23-03-14 15:08:43, Matthew Wilcox wrote: > > > The only remaining usage is userspace's 'xip' option. > > Looks good. You can add: > > Reviewed-by: Jan Kara > > I've been thinking about this patch, and I'm not happy with it any more :-) > > I want to migrate people away from using 'xip' to 'dax' without breaking > anybody's scripts. So I'm thinking about adding a new 'dax' option and > having the 'xip' option print a warning and force-enable the 'dax' option. > That way people who might have scripts to look for 'xip' in /proc/mounts > won't break. Yeah, that sounds reasonable. Maybe we could even show only 'dax' in /proc/mounts since I somewhat doubt there are any users who care. But showing also 'xip' when used is easy enough so why not. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 18/22] xip: Add xip_zero_page_range Date: Thu, 10 Apr 2014 20:43:46 +0200 Message-ID: <20140410184346.GD8060@quack.suse.cz> References: <5a87acda8c3e4d2b7ea5dd1249fcbf8be23b9645.1395591795.git.matthew.r.wilcox@intel.com> <20140409101512.GL32103@quack.suse.cz> <20140410142729.GL5727@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ross Zwisler To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <20140410142729.GL5727@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Thu 10-04-14 10:27:29, Matthew Wilcox wrote: > On Wed, Apr 09, 2014 at 12:15:12PM +0200, Jan Kara wrote: > > > + /* > > > + * ext4 sometimes asks to zero past the end of a block. It > > > + * really just wants to zero to the end of the block. > > > + */ > > Then we should really fix ext4 I believe... > > Since I didn't want to do this ... > > > > +/* Can't be a function because PAGE_CACHE_SIZE is defined in pagemap.h */ > > > +#define dax_truncate_page(inode, from, get_block) \ > > > + dax_zero_page_range(inode, from, PAGE_CACHE_SIZE, get_block) > > ^^^^ > > This should be (PAGE_CACHE_SIZE - (from & (PAGE_CACHE_SIZE - 1))), shouldn't it? > > ... I could get away without doing that ;-) I understand but ultimately the API is cleaner if it doesn't allow size past end of block. So IMHO we shouldn't introduce new places that call the function like this and we should fix places that do it now (make it WARN_ON_ONCE() and let ext4 guys do the work for you ;). Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Sun, 13 Apr 2014 07:21:32 -0400 Message-ID: <20140413112132.GP5727@linux.intel.com> References: <20140408220525.GC26019@quack.suse.cz> <20140409204806.GF5727@linux.intel.com> <20140409211203.GP32103@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Jan Kara Return-path: Content-Disposition: inline In-Reply-To: <20140409211203.GP32103@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed, Apr 09, 2014 at 11:12:03PM +0200, Jan Kara wrote: > This would be fine except that unmap_mapping_range() grabs i_mmap_mutex > again :-|. But it might be easier to provide a version of that function > which assumes i_mmap_mutex is already locked than what I was suggesting. *sigh*. I knew that once ... which was why the call was after dropping the lock. OK, another try at fixing the problem; handle it down in the insert_pfn code: diff --git a/fs/dax.c b/fs/dax.c index 6a8725b..2453025 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -390,7 +390,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, error = dax_get_pfn(&bh, &pfn, blkbits); if (error > 0) - error = vm_insert_mixed(vma, vaddr, pfn); + error = vm_replace_mixed(vma, vaddr, pfn); mutex_unlock(&mapping->i_mmap_mutex); if (page) { diff --git a/include/linux/mm.h b/include/linux/mm.h index ba72c54..df25410 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1944,8 +1944,12 @@ int remap_pfn_range(struct vm_area_struct *, unsigned long addr, int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *); int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); -int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, - unsigned long pfn); +int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn, bool replace); +#define vm_insert_mixed(vma, addr, pfn) \ + __vm_insert_mixed(vma, addr, pfn, false) +#define vm_replace_mixed(vma, addr, pfn) \ + __vm_insert_mixed(vma, addr, pfn, true) int vm_insert_pfn_pmd(struct vm_area_struct *, unsigned long addr, pmd_t *, unsigned long pfn); int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len); diff --git a/mm/memory.c b/mm/memory.c index 76fd657..ec59239 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2100,7 +2100,7 @@ pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr, * pages reserved for the old functions anyway. */ static int insert_page(struct vm_area_struct *vma, unsigned long addr, - struct page *page, pgprot_t prot) + struct page *page, pgprot_t prot, bool replace) { struct mm_struct *mm = vma->vm_mm; int retval; @@ -2116,8 +2116,12 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr, if (!pte) goto out; retval = -EBUSY; - if (!pte_none(*pte)) - goto out_unlock; + if (!pte_none(*pte)) { + if (!replace) + goto out_unlock; + VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex)); + zap_page_range_single(vma, addr, PAGE_SIZE, NULL); + } /* Ok, finally just insert the thing.. */ get_page(page); @@ -2173,12 +2177,12 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr, BUG_ON(vma->vm_flags & VM_PFNMAP); vma->vm_flags |= VM_MIXEDMAP; } - return insert_page(vma, addr, page, vma->vm_page_prot); + return insert_page(vma, addr, page, vma->vm_page_prot, false); } EXPORT_SYMBOL(vm_insert_page); static int insert_pfn(struct vm_area_struct *vma, unsigned long addr, - unsigned long pfn, pgprot_t prot) + unsigned long pfn, pgprot_t prot, bool replace) { struct mm_struct *mm = vma->vm_mm; int retval; @@ -2190,8 +2194,12 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr, if (!pte) goto out; retval = -EBUSY; - if (!pte_none(*pte)) - goto out_unlock; + if (!pte_none(*pte)) { + if (!replace) + goto out_unlock; + VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex)); + zap_page_range_single(vma, addr, PAGE_SIZE, NULL); + } /* Ok, finally just insert the thing.. */ entry = pte_mkspecial(pfn_pte(pfn, prot)); @@ -2244,14 +2252,14 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, if (track_pfn_insert(vma, &pgprot, pfn)) return -EINVAL; - ret = insert_pfn(vma, addr, pfn, pgprot); + ret = insert_pfn(vma, addr, pfn, pgprot, false); return ret; } EXPORT_SYMBOL(vm_insert_pfn); -int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, - unsigned long pfn) +int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn, bool replace) { BUG_ON(!(vma->vm_flags & VM_MIXEDMAP)); @@ -2269,11 +2277,11 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, struct page *page; page = pfn_to_page(pfn); - return insert_page(vma, addr, page, vma->vm_page_prot); + return insert_page(vma, addr, page, vma->vm_page_prot, replace); } - return insert_pfn(vma, addr, pfn, vma->vm_page_prot); + return insert_pfn(vma, addr, pfn, vma->vm_page_prot, replace); } -EXPORT_SYMBOL(vm_insert_mixed); +EXPORT_SYMBOL(__vm_insert_mixed); static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, unsigned long pfn, pgprot_t prot) > > > > +int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > > > > + get_block_t get_block) > > > > +{ > > > > + int result; > > > > + struct super_block *sb = file_inode(vma->vm_file)->i_sb; > > > > + > > > > + sb_start_pagefault(sb); > > > You don't need any filesystem freeze protection for the fault handler > > > since that's not going to modify the filesystem. > > > > Err ... we might allocate a block as a result of doing a write to a hole. > > Or does that not count as 'modifying the filesystem' in this context? > Ah, it does. But it would be nice to avoid doing sb_start_pagefault() if > it's not a write fault - because you don't want to block reading from a > frozen filesystem (imagine what would happen when you freeze your root > filesystem to do a snapshot...). > > I have somewhat a mindset of standard pagecache mmap where filemap_fault() > only reads in data regardless of FAULT_FLAG_WRITE setting so I was confused > by your difference :). Understood! So this should work: diff --git a/fs/dax.c b/fs/dax.c index 2453025..e4d00fc 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -431,10 +431,13 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, int result; struct super_block *sb = file_inode(vma->vm_file)->i_sb; - sb_start_pagefault(sb); - file_update_time(vma->vm_file); + if (vmf->flags & FAULT_FLAG_WRITE) { + sb_start_pagefault(sb); + file_update_time(vma->vm_file); + } result = do_dax_fault(vma, vmf, get_block); - sb_end_pagefault(sb); + if (vmf->flags & FAULT_FLAG_WRITE) + sb_end_pagefault(sb); return result; } @@ -453,15 +456,7 @@ EXPORT_SYMBOL_GPL(dax_fault); int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, get_block_t get_block) { - int result; - struct super_block *sb = file_inode(vma->vm_file)->i_sb; - - sb_start_pagefault(sb); - file_update_time(vma->vm_file); - result = do_dax_fault(vma, vmf, get_block); - sb_end_pagefault(sb); - - return result; + return dax_fault(vma, vmf, get_block); } EXPORT_SYMBOL_GPL(dax_mkwrite); > > > > + file_update_time(vma->vm_file); > > > Why do you update m/ctime? We are only reading the file... > > > > ... except that it might be a write fault. I think we modify the file > > iff we return VM_FAULT_MAJOR from do_dax_fault(). So I'd be open to > > something like this: > > > > sb_start_pagefault(sb); > > result = do_dax_fault(vma, vmf, get_block); > > if (result & VM_FAULT_MAJOR) > > file_update_time(vma->vm_file); > > sb_end_pagefault(sb); > > > > Would that work better for you? > Definitely. It's also a performance thing BTW - updating time stamps is > relatively expensive for journalling filesystems - you have to start a > transaction, add block with inode to the journal, stop a transaction - not > something you want to do unless you have to. I realised that this isn't right. If you do a store to an mmaped file, you should update the timestamps, whether or not the fs had to allocate blocks. Hence the version above that only checks whether the fault is for write or not. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Sun, 13 Apr 2014 14:03:29 -0400 Message-ID: <20140413180329.GR5727@linux.intel.com> References: <20140409102758.GM32103@quack.suse.cz> <20140409205111.GG5727@linux.intel.com> <20140409214331.GQ32103@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Jan Kara Return-path: Content-Disposition: inline In-Reply-To: <20140409214331.GQ32103@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed, Apr 09, 2014 at 11:43:31PM +0200, Jan Kara wrote: > On Wed 09-04-14 16:51:11, Matthew Wilcox wrote: > > On Wed, Apr 09, 2014 at 12:27:58PM +0200, Jan Kara wrote: > > > > + if (unlikely(vmf->pgoff >= size)) { > > > > + mutex_unlock(&mapping->i_mmap_mutex); > > > > + goto sigbus; > > > You need to release the block you've got from the filesystem in case of > > > error here an below. > > > > What's the API to do that? Call inode->i_op->setattr()? > That's a great question. Yes, ->setattr() is the only API you have for > that but you cannot use that because of locking constraints (it needs > i_mutex and that's not possible to get in the fault path). Let me read > again what the handler does... > > So there are three places that can fail after we allocate the block: > 1) We race with truncate reducing i_size > 2) dax_get_pfn() fails > 3) vm_insert_mixed() fails > > I would guess that 2) can fail only if the HW has problems and leaking > block in that case could be acceptable (please correct me if I'm wrong). > 3) shouldn't fail because of ENOMEM because fault has already allocated all > the page tables and EBUSY should be handled as well. So the only failure we > have to care about is 1). And we could move ->get_block() call under > i_mmap_mutex after the i_size check. Lock ordering should be fine because > i_mmap_mutex ranks above page lock under which we do block mapping in > standard ->page_mkwrite callbacks. The only (big) drawback is that > i_mmap_mutex will now be held for much longer time and thus the contention > would be much higher. But hopefully once we resolve our problems with > mmap_sem and introduce mapping range lock we could scale reasonably. I think you're right about the only failure case to worry about being (1). For 2 or 3, we haven't *leaked* the block, we've merely allocated it, found out we couldn't use it, and then not freed it. It'll be freed when the file is deleted or truncated. Taking the i_mmap_mutex earlier looks reasonable. I'll do that. As far as reducing contention on i_mmap_mutex goes, I'm currently planning on using an exceptional entry in the radix tree, designating one bit of that as the lock bit and using the remaining 29 / 61 bits to cache the PFN. That lock would then have the same rank as the page lock. It might be interesting to build that kind of 'locking' into the radix tree ... I'm half-thinking about taking a lock higher in the radix tree to cover large pages. I'll probably just use the lock bit in the entry that would cover the head page, though. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH v7 06/22] Replace XIP read and write with DAX I/O Date: Sun, 13 Apr 2014 14:05:52 -0400 Message-ID: <20140413180552.GS5727@linux.intel.com> References: <3ebe329d8713f7db4c105021a845316a47a29797.1395591795.git.matthew.r.wilcox@intel.com> <20140408175600.GE2713@quack.suse.cz> <20140408202102.GB5727@linux.intel.com> <20140409091450.GA32103@quack.suse.cz> <20140409151908.GD5727@linux.intel.com> <20140409205529.GO32103@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Jan Kara Return-path: Content-Disposition: inline In-Reply-To: <20140409205529.GO32103@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed, Apr 09, 2014 at 10:55:29PM +0200, Jan Kara wrote: > > In addition to writing back dirty pages, filemap_write_and_wait_range() > > will evict clean pages. Unintuitive, I know, but it matches what the > > direct I/O path does. Plus, if we fall back to buffered I/O for holes > > (see above), then this will do the right thing at that time. > Ugh, I'm pretty certain filemap_write_and_wait_range() doesn't evict > anything ;). Direct IO path calls that function so that direct IO read > after buffered write returns the written data. In that case we don't evict > anything from page cache because direct IO read doesn't invalidate any > information we have cached. Only direct IO write does that and for that we > call invalidate_inode_pages2_range() after writing the pages. So I maintain > that what you do doesn't make sense to me. You might need to do some > invalidation of hole pages. But note that generic_file_direct_write() does > that for you and even though that isn't serialized in any way with page > faults which can instantiate the hole pages again, things should work out > fine for you since that function also invalidates the range again after > ->direct_IO callback is done. So AFAICT you don't have to do anything > except writing some nice comment about this ;). You're right. I'm not sure what I got confused with there. I don't think there's a race I need to worry about ... even if another page gets instantiated (consider one thread furiously loading from a hole as fast as it can while another thread does a write), we'll shoot it down again. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH v7 08/22] Replace xip_truncate_page with dax_truncate_page Date: Sun, 13 Apr 2014 15:07:21 -0400 Message-ID: <20140413190721.GA21460@linux.intel.com> References: <20140408221759.GD26019@quack.suse.cz> <20140409092635.GB32103@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Jan Kara Return-path: Content-Disposition: inline In-Reply-To: <20140409092635.GB32103@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed, Apr 09, 2014 at 11:26:35AM +0200, Jan Kara wrote: > I thought about this for a while and classical IO, truncation etc. could > easily work for blocksize < pagesize. And for mmap() you could just use > pagecache. Not sure if it's worth the complications though. Anyway we > should decide whether we don't care about blocksize < PAGE_CACHE_SIZE at > all, or whether we try to make things which can work reasonably easily > functional. In that case dax_truncate_page() needs some tweaking because it > currently assumes blocksize == PAGE_CACHE_SIZE. I think it actually assumes that blocksize <= PAGE_CACHE_SIZE in that it doesn't contain a loop to iterate over all blocks. It wouldn't be hard to fix but I'll just put in a comment noting what needs to be fixed ... I don't think there's going to be a lot of enthusiasm for adding support for blocksize != PAGE_SIZE / PAGE_CACHE_SIZE. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Mon, 14 Apr 2014 18:04:57 +0200 Message-ID: <20140414160457.GB13860@quack.suse.cz> References: <20140408220525.GC26019@quack.suse.cz> <20140409204806.GF5727@linux.intel.com> <20140409211203.GP32103@quack.suse.cz> <20140413112132.GP5727@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <20140413112132.GP5727@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sun 13-04-14 07:21:32, Matthew Wilcox wrote: > On Wed, Apr 09, 2014 at 11:12:03PM +0200, Jan Kara wrote: > > This would be fine except that unmap_mapping_range() grabs i_mmap_mutex > > again :-|. But it might be easier to provide a version of that function > > which assumes i_mmap_mutex is already locked than what I was suggesting. > > *sigh*. I knew that once ... which was why the call was after dropping > the lock. OK, another try at fixing the problem; handle it down in the > insert_pfn code: OK, that change looks OK to me (although you might want to introduce vm_replace_mixed() in a separate patch). > > > > > +int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > > > > > + get_block_t get_block) > > > > > +{ > > > > > + int result; > > > > > + struct super_block *sb = file_inode(vma->vm_file)->i_sb; > > > > > + > > > > > + sb_start_pagefault(sb); > > > > You don't need any filesystem freeze protection for the fault handler > > > > since that's not going to modify the filesystem. > > > > > > Err ... we might allocate a block as a result of doing a write to a hole. > > > Or does that not count as 'modifying the filesystem' in this context? > > Ah, it does. But it would be nice to avoid doing sb_start_pagefault() if > > it's not a write fault - because you don't want to block reading from a > > frozen filesystem (imagine what would happen when you freeze your root > > filesystem to do a snapshot...). > > > > I have somewhat a mindset of standard pagecache mmap where filemap_fault() > > only reads in data regardless of FAULT_FLAG_WRITE setting so I was confused > > by your difference :). > > Understood! So this should work: > > diff --git a/fs/dax.c b/fs/dax.c > index 2453025..e4d00fc 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -431,10 +431,13 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > int result; > struct super_block *sb = file_inode(vma->vm_file)->i_sb; > > - sb_start_pagefault(sb); > - file_update_time(vma->vm_file); > + if (vmf->flags & FAULT_FLAG_WRITE) { > + sb_start_pagefault(sb); > + file_update_time(vma->vm_file); > + } Yup, this looks good to me. Later if we find file_update_time() is slowing down faults too much, we can defer the actual update to msync() / close() time (POSIX actually allows that). But that's definitely for future. > result = do_dax_fault(vma, vmf, get_block); > - sb_end_pagefault(sb); > + if (vmf->flags & FAULT_FLAG_WRITE) > + sb_end_pagefault(sb); > > return result; > } Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Boaz Harrosh Subject: Re: [PATCH v7 00/22] Support ext4 on NV-DIMMs Date: Sun, 18 May 2014 17:58:16 +0300 Message-ID: <5378CA88.3080105@gmail.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, willy@linux.intel.com To: Matthew Wilcox , linux-kernel@vger.kernel.org, Sagi Manole Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 03/23/2014 09:08 PM, Matthew Wilcox wrote: > One of the primary uses for NV-DIMMs is to expose them as a block device > and use a filesystem to store files on the NV-DIMM. While that works, > it currently wastes memory and CPU time buffering the files in the page > cache. We have support in ext2 for bypassing the page cache, but it > has some races which are unfixable in the current design. This series > of patches rewrite the underlying support, and add support for direct > access to ext4. > > This iteration of the patchset rebases to Linus' 3.14-rc7 (plus Kirill's > patches in linux-next http://marc.info/?l=linux-mm&m=139206489208546&w=2) Hi Matthew We are experimenting with NV-DIMMs. The experiment will use its own FS not based on ext4 at all, more like the infamous PMFS but we want to start DAX based and not current XIP based. We want to make sure the proposed new API can be utilized stand alone and there are no extX based assumptions. (Like the need for direct directory access instead of the ext4 copy-from-nvdimm-to-ram directory) Could you please put these patches on a public tree somewhere, or perhaps some later version, that I can pull directly from? this would help alot. These patches are a bit hard to patch because it is not clear what Kirill's patches I need. I tried some linux-next version around 3.14-rc7 that also include Kirill's patches but it looks like there was farther work done then your base. I was able to produce a tree with V6 of your patches but I would hate to do that manual work yet again. (Any linux base is fine just that I can pull it) Thanks Also I'm curios. I see you guys where working on PMFS for a while fixing and enhancing stuff. Then development stopped and these DAX patches started showing. Now, PMFS is based on current XIP (I was able to easily port it to 3.14-rc7). Do you guys have an Internal attempt to port PMFS to DAX? (We might do it in future just as an exercise to get intimate with DAX and to make sure nothing is missing.) What are your plans with PMFS is it dead? Good day Boaz > and fixes several bugs: > > - Initialise cow_page in do_page_mkwrite() (Matthew Wilcox) > - Clear new or unwritten blocks in page fault handler (Matthew Wilcox) > - Only call get_block when necessary (Matthew Wilcox) > - Reword Kconfig options (Matthew Wilcox / Vishal Verma) > - Fix a race between page fault and truncate (Matthew Wilcox) > - Fix a race between fault-for-read and fault-for-write (Matthew Wilcox) > - Zero the correct bytes in dax_new_buf() (Toshi Kani) > - Add DIO_LOCKING to an invocation of dax_do_io in ext4 (Ross Zwisler) > > Relative to the last patchset, I folded the 'Add reporting of major faults' > patch into the patch that adds the DAX page fault handler. > > The v6 patchset had seven additional xfstests failures. This patchset > now passes approximately as many xfstests as ext4 does on a ramdisk. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH v7 00/22] Support ext4 on NV-DIMMs Date: Sun, 18 May 2014 19:24:03 -0400 Message-ID: <20140518232403.GF6121@linux.intel.com> References: <5378CA88.3080105@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-kernel@vger.kernel.org, Sagi Manole , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org To: Boaz Harrosh Return-path: Content-Disposition: inline In-Reply-To: <5378CA88.3080105@gmail.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sun, May 18, 2014 at 05:58:16PM +0300, Boaz Harrosh wrote: > We are experimenting with NV-DIMMs. The experiment will use its own > FS not based on ext4 at all, more like the infamous PMFS but we want > to start DAX based and not current XIP based. We want to make sure the proposed > new API can be utilized stand alone and there are no extX based assumptions. > (Like the need for direct directory access instead of the ext4 > copy-from-nvdimm-to-ram directory) Hi Boaz, Best of luck with your new filesystem. > Could you please put these patches on a public tree somewhere, or perhaps some > later version, that I can pull directly from? this would help alot. I'm preparing a v8 right now; probably be availble by the end of the week. > Also I'm curios. I see you guys where working on PMFS for a while > fixing and enhancing stuff. Then development stopped and these DAX > patches started showing. Now, PMFS is based on current XIP (I was able > to easily port it to 3.14-rc7). Do you guys have an Internal attempt > to port PMFS to DAX? (We might do it in future just as an exercise > to get intimate with DAX and to make sure nothing is missing.) > What are your plans with PMFS is it dead? My group has no plans to do any more work with PMFS, and I'm not aware of anyone else planning on turning PMFS into a production-quality filesystem. But the code is out there and we can't stop anybody else from working on it. PMFS uses neither DAX nor XIP; it doesn't sit on top of a block device. We would probably have moved it to sit on top of a block device by now had we been developing it further. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Toshi Kani Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Wed, 21 May 2014 14:35:07 -0600 Message-ID: <1400704507.18128.23.camel@misato.fc.hp.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com To: Matthew Wilcox Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sun, 2014-03-23 at 15:08 -0400, Matthew Wilcox wrote: : > +static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > + get_block_t get_block) > +{ : > + error = dax_get_pfn(inode, &bh, &pfn); > + if (error > 0) > + error = vm_insert_mixed(vma, vaddr, pfn); > + mutex_unlock(&mapping->i_mmap_mutex); > + > + if (page) { > + delete_from_page_cache(page); > + unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, > + PAGE_CACHE_SIZE, 0); > + unlock_page(page); > + page_cache_release(page); Hi Matthew, I am seeing a problem in this code path, where it deletes a page cache page mapped to a hole. Sometimes, page->_mapcount is 0, not -1, which leads __delete_from_page_cache(), called from delete_from_page_cache(), to hit the following BUG_ON. BUG_ON(page_mapped(page)) I suppose such page has a shared mapping. Does this code need to take care of replacing shared mappings in such case? Thanks, -Toshi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Toshi Kani Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Thu, 05 Jun 2014 16:38:34 -0600 Message-ID: <1402007914.7963.8.camel@misato.fc.hp.com> References: <1400704507.18128.23.camel@misato.fc.hp.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com To: Matthew Wilcox Return-path: In-Reply-To: <1400704507.18128.23.camel@misato.fc.hp.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed, 2014-05-21 at 14:35 -0600, Toshi Kani wrote: > On Sun, 2014-03-23 at 15:08 -0400, Matthew Wilcox wrote: > : > > +static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > > + get_block_t get_block) > > +{ > : > > + error = dax_get_pfn(inode, &bh, &pfn); > > + if (error > 0) > > + error = vm_insert_mixed(vma, vaddr, pfn); > > + mutex_unlock(&mapping->i_mmap_mutex); > > + > > + if (page) { > > + delete_from_page_cache(page); > > + unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, > > + PAGE_CACHE_SIZE, 0); > > + unlock_page(page); > > + page_cache_release(page); > > Hi Matthew, > > I am seeing a problem in this code path, where it deletes a page cache > page mapped to a hole. Sometimes, page->_mapcount is 0, not -1, which > leads __delete_from_page_cache(), called from delete_from_page_cache(), > to hit the following BUG_ON. > > BUG_ON(page_mapped(page)) > > I suppose such page has a shared mapping. Does this code need to take > care of replacing shared mappings in such case? Hi Matthew, The following change works in my environment. What do you think? Thanks, -Toshi --- fs/dax.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/dax.c b/fs/dax.c index 2d6b4bc..046c6d6 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -26,6 +26,7 @@ #include #include #include +#include int dax_clear_blocks(struct inode *inode, sector_t block, long size) { @@ -385,6 +386,8 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, mutex_unlock(&mapping->i_mmap_mutex); if (page) { + if (page_mapped(page)) + try_to_unmap(page, TTU_UNMAP|TTU_IGNORE_ACCESS); delete_from_page_cache(page); unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, PAGE_CACHE_SIZE, 0); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Boaz Harrosh Subject: Re: [PATCH v7 00/22] Support ext4 on NV-DIMMs Date: Tue, 17 Jun 2014 21:11:47 +0300 Message-ID: <53A084E3.6080103@gmail.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: willy@linux.intel.com To: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 03/23/2014 09:08 PM, Matthew Wilcox wrote: > One of the primary uses for NV-DIMMs is to expose them as a block device > and use a filesystem to store files on the NV-DIMM. While that works, > it currently wastes memory and CPU time buffering the files in the page > cache. We have support in ext2 for bypassing the page cache, but it > has some races which are unfixable in the current design. This series > of patches rewrite the underlying support, and add support for direct > access to ext4. > > This iteration of the patchset rebases to Linus' 3.14-rc7 (plus Kirill's > patches in linux-next http://marc.info/?l=linux-mm&m=139206489208546&w=2) > and fixes several bugs: > > - Initialise cow_page in do_page_mkwrite() (Matthew Wilcox) > - Clear new or unwritten blocks in page fault handler (Matthew Wilcox) > - Only call get_block when necessary (Matthew Wilcox) > - Reword Kconfig options (Matthew Wilcox / Vishal Verma) > - Fix a race between page fault and truncate (Matthew Wilcox) > - Fix a race between fault-for-read and fault-for-write (Matthew Wilcox) > - Zero the correct bytes in dax_new_buf() (Toshi Kani) > - Add DIO_LOCKING to an invocation of dax_do_io in ext4 (Ross Zwisler) > > Relative to the last patchset, I folded the 'Add reporting of major faults' > patch into the patch that adds the DAX page fault handler. > > The v6 patchset had seven additional xfstests failures. This patchset > now passes approximately as many xfstests as ext4 does on a ramdisk. > > Matthew Wilcox (21): > Fix XIP fault vs truncate race > Allow page fault handlers to perform the COW > axonram: Fix bug in direct_access > Change direct_access calling convention > Introduce IS_DAX(inode) > Replace XIP read and write with DAX I/O > Replace the XIP page fault handler with the DAX page fault handler > Replace xip_truncate_page with dax_truncate_page > Remove mm/filemap_xip.c > Remove get_xip_mem > Replace ext2_clear_xip_target with dax_clear_blocks > ext2: Remove ext2_xip_verify_sb() > ext2: Remove ext2_use_xip > ext2: Remove xip.c and xip.h > Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX > ext2: Remove ext2_aops_xip > Get rid of most mentions of XIP in ext2 > xip: Add xip_zero_page_range > ext4: Make ext4_block_zero_page_range static > ext4: Fix typos > brd: Rename XIP to DAX Hi Matthew I have some more trouble with DAX (and old XIP) please forgive me if I'm just senile and clueless. And put some sense into me. The title of this patchset is "ext4 on NV-DIMMs" But all I see is that DAX (and old XIP) is supported by mounting over brd devices. (On x86 I'm not sure about the other drivers) But looking to use brd with real NV_DIMMS fails miserably. (I'm talking about the RAM based NV_DIMMS (backed by flash) and not about the block based Diablo DDR bus flash devices type) Looking at the brd code I fail to see how it will ever support NV_DIMMS. brd is "struct page" based and shares RAM from the same memory pool as the rest of the system. But NV_DIMMS is not page-based and is excluded from the memory system. It needs to be exclusively owned by a device and the mounted FS. We currently have in our lab the old DDR3 based NV_DIMMS and on regular boot it appears as RAM. We need to use memmap= option on command line of Kernel to exclude it from use by Kernel. We have received our DDR4 based NV_DIMMS but still waiting for the actual system board to support it. As I understand from STD documentation these devices will not identify as RAM and will be exported as ACPI or SBUS devices that can be queried for sizes and address as well as properties about the chips. So I imagine a udev rule will need to probe the right driver to mount over those. So currently from what I can see only the infamous PMFS is the setup that can actually mount/support my NV_DIMMS today. It seems to me like we need a *new* block device that receives, like PMFS, an physical_address + size on load and will export this raw region as a block device. Of course with support of new DAX API. Should I send in such a device code. (I've seen the linux-nvdimm project on github but did not see how my above problem is addressed, it looks geared for that other type DDR bus devices) So please how is all that suppose to work, what is the strategy stack for all this? I guess for now I'm stuck with PMFS. (BTW: A public git tree of DAX patches ;-) ) Thanks Boaz > > Ross Zwisler (1): > ext4: Add DAX functionality > > Documentation/filesystems/Locking | 3 - > Documentation/filesystems/dax.txt | 84 ++++++ > Documentation/filesystems/ext4.txt | 2 + > Documentation/filesystems/xip.txt | 68 ----- > arch/powerpc/sysdev/axonram.c | 8 +- > drivers/block/Kconfig | 13 +- > drivers/block/brd.c | 22 +- > drivers/s390/block/dcssblk.c | 19 +- > fs/Kconfig | 21 +- > fs/Makefile | 1 + > fs/dax.c | 509 +++++++++++++++++++++++++++++++++++++ > fs/exofs/inode.c | 1 - > fs/ext2/Kconfig | 11 - > fs/ext2/Makefile | 1 - > fs/ext2/ext2.h | 9 +- > fs/ext2/file.c | 45 +++- > fs/ext2/inode.c | 37 +-- > fs/ext2/namei.c | 13 +- > fs/ext2/super.c | 48 ++-- > fs/ext2/xip.c | 91 ------- > fs/ext2/xip.h | 26 -- > fs/ext4/ext4.h | 8 +- > fs/ext4/file.c | 53 +++- > fs/ext4/indirect.c | 19 +- > fs/ext4/inode.c | 94 ++++--- > fs/ext4/namei.c | 10 +- > fs/ext4/super.c | 39 ++- > fs/open.c | 5 +- > include/linux/blkdev.h | 4 +- > include/linux/fs.h | 49 +++- > include/linux/mm.h | 2 + > mm/Makefile | 1 - > mm/fadvise.c | 6 +- > mm/filemap.c | 6 +- > mm/filemap_xip.c | 483 ----------------------------------- > mm/madvise.c | 2 +- > mm/memory.c | 45 +++- > 37 files changed, 984 insertions(+), 874 deletions(-) > create mode 100644 Documentation/filesystems/dax.txt > delete mode 100644 Documentation/filesystems/xip.txt > create mode 100644 fs/dax.c > delete mode 100644 fs/ext2/xip.c > delete mode 100644 fs/ext2/xip.h > delete mode 100644 mm/filemap_xip.c > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH v7 00/22] Support ext4 on NV-DIMMs Date: Tue, 17 Jun 2014 14:19:25 -0400 Message-ID: <20140617181925.GF12025@linux.intel.com> References: <53A084E3.6080103@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Boaz Harrosh Return-path: Content-Disposition: inline In-Reply-To: <53A084E3.6080103@gmail.com> Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org On Tue, Jun 17, 2014 at 09:11:47PM +0300, Boaz Harrosh wrote: > Looking at the brd code I fail to see how it will ever support NV_DIMMS. > brd is "struct page" based and shares RAM from the same memory pool as the rest > of the system. But NV_DIMMS is not page-based and is excluded from the > memory system. It needs to be exclusively owned by a device and the mounted > FS. > > We currently have in our lab the old DDR3 based NV_DIMMS and on regular boot > it appears as RAM. We need to use memmap= option on command line of Kernel > to exclude it from use by Kernel. > > We have received our DDR4 based NV_DIMMS but still waiting for the actual > system board to support it. As I understand from STD documentation > these devices will not identify as RAM and will be exported as ACPI or > SBUS devices that can be queried for sizes and address as well as properties > about the chips. So I imagine a udev rule will need to probe the right driver > to mount over those. > > So currently from what I can see only the infamous PMFS is the setup that > can actually mount/support my NV_DIMMS today. > > It seems to me like we need a *new* block device that receives, like PMFS, > an physical_address + size on load and will export this raw region as a block > device. Of course with support of new DAX API. Should I send in such a device > code. > > (I've seen the linux-nvdimm project on github but did not see how my above > problem is addressed, it looks geared for that other type DDR bus devices) > > So please how is all that suppose to work, what is the strategy stack > for all this? I guess for now I'm stuck with PMFS. > > (BTW: A public git tree of DAX patches ;-) ) https://github.com/01org/prd should sort you out with both a git tree and a new block driver. You'll need to tell it manually what address range to use. I'm using it against regular DIMMs, and this works pretty well for me since my BIOS doesn't zero DRAM on reset. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Boaz Harrosh Subject: Re: [PATCH v7 00/22] Support ext4 on NV-DIMMs Date: Tue, 17 Jun 2014 21:39:03 +0300 Message-ID: <53A08B47.3010701@gmail.com> References: <53A084E3.6080103@gmail.com> <20140617181925.GF12025@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Matthew Wilcox Return-path: In-Reply-To: <20140617181925.GF12025@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 06/17/2014 09:19 PM, Matthew Wilcox wrote: > On Tue, Jun 17, 2014 at 09:11:47PM +0300, Boaz Harrosh wrote: > > https://github.com/01org/prd should sort you out with both a git tree > and a new block driver. You'll need to tell it manually what address > range to use. I'm using it against regular DIMMs, and this works pretty > well for me since my BIOS doesn't zero DRAM on reset. > God Yes exactly my missing link, Thanks. How I failed to find it? Yes for us too, BIOS doesn't zero DRAM and we can use it with using memmap= on kernel boot. Please include above link in new patchset and Documentation. Just to make the overall picture clearer. BTW what prevents from submitting this prd driver upstream right now? there are devices out there that will need it no? Even for something simple and very smart as putting my ext4 or xfs journal device on nv-dimm, no? The "manually address range to use" is fine in my book. A user-mode udev rule can then be used to cover the gap from sbus or acpi to prd. Hey actually this tree has everything I need. thanks man Boaz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Tue, 29 Jul 2014 08:12:59 -0400 Message-ID: <20140729121259.GL6754@linux.intel.com> References: <20140409102758.GM32103@quack.suse.cz> <20140409205111.GG5727@linux.intel.com> <20140409214331.GQ32103@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Jan Kara Return-path: Content-Disposition: inline In-Reply-To: <20140409214331.GQ32103@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed, Apr 09, 2014 at 11:43:31PM +0200, Jan Kara wrote: > So there are three places that can fail after we allocate the block: > 1) We race with truncate reducing i_size > 2) dax_get_pfn() fails > 3) vm_insert_mixed() fails > > I would guess that 2) can fail only if the HW has problems and leaking > block in that case could be acceptable (please correct me if I'm wrong). > 3) shouldn't fail because of ENOMEM because fault has already allocated all > the page tables and EBUSY should be handled as well. So the only failure we > have to care about is 1). And we could move ->get_block() call under > i_mmap_mutex after the i_size check. Lock ordering should be fine because > i_mmap_mutex ranks above page lock under which we do block mapping in > standard ->page_mkwrite callbacks. The only (big) drawback is that > i_mmap_mutex will now be held for much longer time and thus the contention > would be much higher. But hopefully once we resolve our problems with > mmap_sem and introduce mapping range lock we could scale reasonably. Lockdep barfs on holding i_mmap_mutex while calling ext4's ->get_block. Path 1: ext4_fallocate -> ext4_punch_hole -> ext4_inode_attach_jinode() -> ... -> lock_map_acquire(&handle->h_lockdep_map); truncate_pagecache_range() -> unmap_mapping_range() -> mutex_lock(&mapping->i_mmap_mutex); Path 2: do_dax_fault() -> mutex_lock(&mapping->i_mmap_mutex); ext4_get_block() -> ... -> lock_map_acquire(&handle->h_lockdep_map); So that idea doesn't work. We can't exclude truncates by incrementing i_dio_count, because we can't take i_mutex in the fault path. I'm stumped. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Tue, 29 Jul 2014 23:04:57 +0200 Message-ID: <20140729210457.GA17807@quack.suse.cz> References: <20140409102758.GM32103@quack.suse.cz> <20140409205111.GG5727@linux.intel.com> <20140409214331.GQ32103@quack.suse.cz> <20140729121259.GL6754@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <20140729121259.GL6754@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Tue 29-07-14 08:12:59, Matthew Wilcox wrote: > On Wed, Apr 09, 2014 at 11:43:31PM +0200, Jan Kara wrote: > > So there are three places that can fail after we allocate the block: > > 1) We race with truncate reducing i_size > > 2) dax_get_pfn() fails > > 3) vm_insert_mixed() fails > > > > I would guess that 2) can fail only if the HW has problems and leaking > > block in that case could be acceptable (please correct me if I'm wrong). > > 3) shouldn't fail because of ENOMEM because fault has already allocated all > > the page tables and EBUSY should be handled as well. So the only failure we > > have to care about is 1). And we could move ->get_block() call under > > i_mmap_mutex after the i_size check. Lock ordering should be fine because > > i_mmap_mutex ranks above page lock under which we do block mapping in > > standard ->page_mkwrite callbacks. The only (big) drawback is that > > i_mmap_mutex will now be held for much longer time and thus the contention > > would be much higher. But hopefully once we resolve our problems with > > mmap_sem and introduce mapping range lock we could scale reasonably. > > Lockdep barfs on holding i_mmap_mutex while calling ext4's ->get_block. > > Path 1: > > ext4_fallocate -> > ext4_punch_hole -> > ext4_inode_attach_jinode() -> ... -> > lock_map_acquire(&handle->h_lockdep_map); > truncate_pagecache_range() -> > unmap_mapping_range() -> > mutex_lock(&mapping->i_mmap_mutex); This is strange. I don't see how ext4_inode_attach_jinode() can ever lead to lock_map_acquire(&handle->h_lockdep_map). Can you post a full trace for this? > Path 2: > do_dax_fault() -> > mutex_lock(&mapping->i_mmap_mutex); > ext4_get_block() -> ... -> > lock_map_acquire(&handle->h_lockdep_map); This is obviously correct. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Tue, 29 Jul 2014 17:23:33 -0400 Message-ID: <20140729212333.GO6754@linux.intel.com> References: <20140409102758.GM32103@quack.suse.cz> <20140409205111.GG5727@linux.intel.com> <20140409214331.GQ32103@quack.suse.cz> <20140729121259.GL6754@linux.intel.com> <20140729210457.GA17807@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Jan Kara Return-path: Received: from mga02.intel.com ([134.134.136.20]:26022 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752073AbaG2VXi (ORCPT ); Tue, 29 Jul 2014 17:23:38 -0400 Content-Disposition: inline In-Reply-To: <20140729210457.GA17807@quack.suse.cz> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Tue, Jul 29, 2014 at 11:04:57PM +0200, Jan Kara wrote: > > Path 1: > > > > ext4_fallocate -> > > ext4_punch_hole -> > > ext4_inode_attach_jinode() -> ... -> > > lock_map_acquire(&handle->h_lockdep_map); > > truncate_pagecache_range() -> > > unmap_mapping_range() -> > > mutex_lock(&mapping->i_mmap_mutex); > This is strange. I don't see how ext4_inode_attach_jinode() can ever lead > to lock_map_acquire(&handle->h_lockdep_map). Can you post a full trace for > this? Unfortunately, lockdep finds the inversion in the other order, so I have the backtraces of this path hitting the i_mmap_mutex while already holding jbd_mutex: ====================================================== [ INFO: possible circular locking dependency detected ] 3.16.0-rc6+ #91 Tainted: G W ------------------------------------------------------- fstest/31836 is trying to acquire lock: (jbd2_handle){+.+.+.}, at: [] start_this_handle+0x193/0x630 [jbd2] but task is already holding lock: (&mapping->i_mmap_mutex){+.+...}, at: [] do_dax_fault+0x4e0/0x640 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&mapping->i_mmap_mutex){+.+...}: [] lock_acquire+0xb2/0x1f0 [] mutex_lock_nested+0x75/0x420 [] unmap_mapping_range+0x6b/0x180 [] truncate_pagecache_range+0x4a/0x60 [] ext4_punch_hole+0x4d1/0x530 [ext4] [] ext4_fallocate+0x156/0xb70 [ext4] [] do_fallocate+0x119/0x1b0 [] SyS_fallocate+0x43/0x70 [] system_call_fastpath+0x16/0x1b -> #0 (jbd2_handle){+.+.+.}: [] __lock_acquire+0x1d01/0x1eb0 [] lock_acquire+0xb2/0x1f0 [] start_this_handle+0x1ee/0x630 [jbd2] [] jbd2__journal_start+0xd4/0x260 [jbd2] [] __ext4_journal_start_sb+0x6d/0x190 [ext4] [] _ext4_get_block+0x16a/0x1c0 [ext4] [] ext4_get_block+0x16/0x20 [ext4] [] do_dax_fault+0x5d9/0x640 [] dax_fault+0x3f/0x90 [] ext4_dax_fault+0x15/0x20 [ext4] [] __do_fault+0x41/0xd0 [] do_shared_fault.isra.56+0x35/0x220 [] handle_mm_fault+0x303/0xf70 [] __do_page_fault+0x1ec/0x5b0 [] do_page_fault+0x22/0x30 [] page_fault+0x28/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&mapping->i_mmap_mutex); lock(jbd2_handle); lock(&mapping->i_mmap_mutex); lock(jbd2_handle); *** DEADLOCK *** 3 locks held by fstest/31836: #0: (&mm->mmap_sem){++++++}, at: [] __do_page_fault+0x182/0x5b0 #1: (sb_pagefaults){++++..}, at: [] dax_fault+0x7a/0x90 #2: (&mapping->i_mmap_mutex){+.+...}, at: [] do_dax_fault+0x4e0/0x640 stack backtrace: CPU: 6 PID: 31836 Comm: fstest Tainted: G W 3.16.0-rc6+ #91 Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./Q87M-D2H, BIOS F6 08/03/2013 ffffffff825e63e0 ffff8800a0fc78c0 ffffffff815c6bc3 ffffffff825e63e0 ffff8800a0fc7900 ffffffff815c4e59 ffff8800a0fc7970 ffff8800a88f4a50 ffff8800a88f4af8 ffff8800a88f5280 0000000000000003 ffff8800a88f5248 Call Trace: [] dump_stack+0x4d/0x66 [] print_circular_bug+0x201/0x20f [] __lock_acquire+0x1d01/0x1eb0 [] ? cyc2ns_read_end+0x20/0x20 [] lock_acquire+0xb2/0x1f0 [] ? start_this_handle+0x193/0x630 [jbd2] [] start_this_handle+0x1ee/0x630 [jbd2] [] ? start_this_handle+0x193/0x630 [jbd2] [] ? new_handle+0x20/0x60 [jbd2] [] jbd2__journal_start+0xd4/0x260 [jbd2] [] ? _ext4_get_block+0x16a/0x1c0 [ext4] [] __ext4_journal_start_sb+0x6d/0x190 [ext4] [] _ext4_get_block+0x16a/0x1c0 [ext4] [] ext4_get_block+0x16/0x20 [ext4] [] do_dax_fault+0x5d9/0x640 [] ? _ext4_get_block+0x1c0/0x1c0 [ext4] [] ? _ext4_get_block+0x1c0/0x1c0 [ext4] [] dax_fault+0x3f/0x90 [] ext4_dax_fault+0x15/0x20 [ext4] [] __do_fault+0x41/0xd0 [] do_shared_fault.isra.56+0x35/0x220 [] handle_mm_fault+0x303/0xf70 [] ? __lock_is_held+0x56/0x80 [] __do_page_fault+0x1ec/0x5b0 [] ? vm_mmap_pgoff+0x9c/0xc0 [] ? up_write+0x1f/0x40 [] ? vm_mmap_pgoff+0x9c/0xc0 [] ? trace_hardirqs_off_thunk+0x3a/0x3c [] do_page_fault+0x22/0x30 [] page_fault+0x28/0x30 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Wed, 30 Jul 2014 11:52:29 +0200 Message-ID: <20140730095229.GA19205@quack.suse.cz> References: <20140409102758.GM32103@quack.suse.cz> <20140409205111.GG5727@linux.intel.com> <20140409214331.GQ32103@quack.suse.cz> <20140729121259.GL6754@linux.intel.com> <20140729210457.GA17807@quack.suse.cz> <20140729212333.GO6754@linux.intel.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="EeQfGwPcQSOJBaQU" Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <20140729212333.GO6754@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org --EeQfGwPcQSOJBaQU Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Tue 29-07-14 17:23:33, Matthew Wilcox wrote: > On Tue, Jul 29, 2014 at 11:04:57PM +0200, Jan Kara wrote: > > > Path 1: > > > > > > ext4_fallocate -> > > > ext4_punch_hole -> > > > ext4_inode_attach_jinode() -> ... -> > > > lock_map_acquire(&handle->h_lockdep_map); > > > truncate_pagecache_range() -> > > > unmap_mapping_range() -> > > > mutex_lock(&mapping->i_mmap_mutex); > > This is strange. I don't see how ext4_inode_attach_jinode() can ever lead > > to lock_map_acquire(&handle->h_lockdep_map). Can you post a full trace for > > this? > > Unfortunately, lockdep finds the inversion in the other order, so I > have the backtraces of this path hitting the i_mmap_mutex while already > holding jbd_mutex: I see the problem now. How about an attached patch? Do you see other lockdep warnings with it? Honza > > ====================================================== > [ INFO: possible circular locking dependency detected ] > 3.16.0-rc6+ #91 Tainted: G W > ------------------------------------------------------- > fstest/31836 is trying to acquire lock: > (jbd2_handle){+.+.+.}, at: [] start_this_handle+0x193/0x630 [jbd2] > > but task is already holding lock: > (&mapping->i_mmap_mutex){+.+...}, at: [] do_dax_fault+0x4e0/0x640 > > which lock already depends on the new lock. > > > the existing dependency chain (in reverse order) is: > > -> #1 (&mapping->i_mmap_mutex){+.+...}: > [] lock_acquire+0xb2/0x1f0 > [] mutex_lock_nested+0x75/0x420 > [] unmap_mapping_range+0x6b/0x180 > [] truncate_pagecache_range+0x4a/0x60 > [] ext4_punch_hole+0x4d1/0x530 [ext4] > [] ext4_fallocate+0x156/0xb70 [ext4] > [] do_fallocate+0x119/0x1b0 > [] SyS_fallocate+0x43/0x70 > [] system_call_fastpath+0x16/0x1b > > -> #0 (jbd2_handle){+.+.+.}: > [] __lock_acquire+0x1d01/0x1eb0 > [] lock_acquire+0xb2/0x1f0 > [] start_this_handle+0x1ee/0x630 [jbd2] > [] jbd2__journal_start+0xd4/0x260 [jbd2] > [] __ext4_journal_start_sb+0x6d/0x190 [ext4] > [] _ext4_get_block+0x16a/0x1c0 [ext4] > [] ext4_get_block+0x16/0x20 [ext4] > [] do_dax_fault+0x5d9/0x640 > [] dax_fault+0x3f/0x90 > [] ext4_dax_fault+0x15/0x20 [ext4] > [] __do_fault+0x41/0xd0 > [] do_shared_fault.isra.56+0x35/0x220 > [] handle_mm_fault+0x303/0xf70 > [] __do_page_fault+0x1ec/0x5b0 > [] do_page_fault+0x22/0x30 > [] page_fault+0x28/0x30 > > other info that might help us debug this: > > Possible unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(&mapping->i_mmap_mutex); > lock(jbd2_handle); > lock(&mapping->i_mmap_mutex); > lock(jbd2_handle); > > *** DEADLOCK *** > > 3 locks held by fstest/31836: > #0: (&mm->mmap_sem){++++++}, at: [] __do_page_fault+0x182/0x5b0 > #1: (sb_pagefaults){++++..}, at: [] dax_fault+0x7a/0x90 > #2: (&mapping->i_mmap_mutex){+.+...}, at: [] do_dax_fault+0x4e0/0x640 > > stack backtrace: > CPU: 6 PID: 31836 Comm: fstest Tainted: G W 3.16.0-rc6+ #91 > Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./Q87M-D2H, BIOS F6 08/03/2013 > ffffffff825e63e0 ffff8800a0fc78c0 ffffffff815c6bc3 ffffffff825e63e0 > ffff8800a0fc7900 ffffffff815c4e59 ffff8800a0fc7970 ffff8800a88f4a50 > ffff8800a88f4af8 ffff8800a88f5280 0000000000000003 ffff8800a88f5248 > Call Trace: > [] dump_stack+0x4d/0x66 > [] print_circular_bug+0x201/0x20f > [] __lock_acquire+0x1d01/0x1eb0 > [] ? cyc2ns_read_end+0x20/0x20 > [] lock_acquire+0xb2/0x1f0 > [] ? start_this_handle+0x193/0x630 [jbd2] > [] start_this_handle+0x1ee/0x630 [jbd2] > [] ? start_this_handle+0x193/0x630 [jbd2] > [] ? new_handle+0x20/0x60 [jbd2] > [] jbd2__journal_start+0xd4/0x260 [jbd2] > [] ? _ext4_get_block+0x16a/0x1c0 [ext4] > [] __ext4_journal_start_sb+0x6d/0x190 [ext4] > [] _ext4_get_block+0x16a/0x1c0 [ext4] > [] ext4_get_block+0x16/0x20 [ext4] > [] do_dax_fault+0x5d9/0x640 > [] ? _ext4_get_block+0x1c0/0x1c0 [ext4] > [] ? _ext4_get_block+0x1c0/0x1c0 [ext4] > [] dax_fault+0x3f/0x90 > [] ext4_dax_fault+0x15/0x20 [ext4] > [] __do_fault+0x41/0xd0 > [] do_shared_fault.isra.56+0x35/0x220 > [] handle_mm_fault+0x303/0xf70 > [] ? __lock_is_held+0x56/0x80 > [] __do_page_fault+0x1ec/0x5b0 > [] ? vm_mmap_pgoff+0x9c/0xc0 > [] ? up_write+0x1f/0x40 > [] ? vm_mmap_pgoff+0x9c/0xc0 > [] ? trace_hardirqs_off_thunk+0x3a/0x3c > [] do_page_fault+0x22/0x30 > [] page_fault+0x28/0x30 > -- Jan Kara SUSE Labs, CR --EeQfGwPcQSOJBaQU Content-Type: text/x-patch; charset=us-ascii Content-Disposition: attachment; filename="0001-ext4-Avoid-lock-inversion-between-i_mmap_mutex-and-t.patch" >>From c01c905cf3c4c6304a5ea9836389d9cf0d575884 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Wed, 30 Jul 2014 11:49:07 +0200 Subject: [PATCH] ext4: Avoid lock inversion between i_mmap_mutex and transaction start When DAX is enabled, it uses i_mmap_mutex as a protection against truncate during page fault. This inevitably forces i_mmap_mutex to rank outside of a transaction start and thus we have to avoid calling pagecache purging operations when transaction is started. Signed-off-by: Jan Kara --- fs/ext4/inode.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 8a064734e6eb..494a8645d63e 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3631,13 +3631,19 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length) if (IS_SYNC(inode)) ext4_handle_sync(handle); - /* Now release the pages again to reduce race window */ + inode->i_mtime = inode->i_ctime = ext4_current_time(inode); + ext4_mark_inode_dirty(handle, inode); + ext4_journal_stop(handle); + + /* + * Now release the pages again to reduce race window. This has to happen + * outside of a transaction to avoid lock inversion on i_mmap_mutex + * when DAX is enabled. + */ if (last_block_offset > first_block_offset) truncate_pagecache_range(inode, first_block_offset, last_block_offset); - - inode->i_mtime = inode->i_ctime = ext4_current_time(inode); - ext4_mark_inode_dirty(handle, inode); + goto out_dio; out_stop: ext4_journal_stop(handle); out_dio: -- 1.8.1.4 --EeQfGwPcQSOJBaQU-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Wed, 30 Jul 2014 17:02:40 -0400 Message-ID: <20140730210239.GS6754@linux.intel.com> References: <20140409102758.GM32103@quack.suse.cz> <20140409205111.GG5727@linux.intel.com> <20140409214331.GQ32103@quack.suse.cz> <20140729121259.GL6754@linux.intel.com> <20140729210457.GA17807@quack.suse.cz> <20140729212333.GO6754@linux.intel.com> <20140730095229.GA19205@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Jan Kara Return-path: Content-Disposition: inline In-Reply-To: <20140730095229.GA19205@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed, Jul 30, 2014 at 11:52:29AM +0200, Jan Kara wrote: > I see the problem now. How about an attached patch? Do you see other > lockdep warnings with it? This patch fixes the problem, thanks! Regardless of DAX, I think this patch should be applied in order to avoid creating a dependency between i_mmap_mutex and jbd2_handle. I've now run into a different problem with COW pages ... more later. > >From c01c905cf3c4c6304a5ea9836389d9cf0d575884 Mon Sep 17 00:00:00 2001 > From: Jan Kara > Date: Wed, 30 Jul 2014 11:49:07 +0200 > Subject: [PATCH] ext4: Avoid lock inversion between i_mmap_mutex and > transaction start > > When DAX is enabled, it uses i_mmap_mutex as a protection against > truncate during page fault. This inevitably forces i_mmap_mutex to rank > outside of a transaction start and thus we have to avoid calling > pagecache purging operations when transaction is started. > > Signed-off-by: Jan Kara > --- > fs/ext4/inode.c | 14 ++++++++++---- > 1 file changed, 10 insertions(+), 4 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 8a064734e6eb..494a8645d63e 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3631,13 +3631,19 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length) > if (IS_SYNC(inode)) > ext4_handle_sync(handle); > > - /* Now release the pages again to reduce race window */ > + inode->i_mtime = inode->i_ctime = ext4_current_time(inode); > + ext4_mark_inode_dirty(handle, inode); > + ext4_journal_stop(handle); > + > + /* > + * Now release the pages again to reduce race window. This has to happen > + * outside of a transaction to avoid lock inversion on i_mmap_mutex > + * when DAX is enabled. > + */ > if (last_block_offset > first_block_offset) > truncate_pagecache_range(inode, first_block_offset, > last_block_offset); > - > - inode->i_mtime = inode->i_ctime = ext4_current_time(inode); > - ext4_mark_inode_dirty(handle, inode); > + goto out_dio; > out_stop: > ext4_journal_stop(handle); > out_dio: > -- > 1.8.1.4 > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Sat, 9 Aug 2014 07:00:00 -0400 Message-ID: <20140809110000.GA32313@linux.intel.com> References: <20140409102758.GM32103@quack.suse.cz> <20140409205111.GG5727@linux.intel.com> <20140409214331.GQ32103@quack.suse.cz> <20140729121259.GL6754@linux.intel.com> <20140729210457.GA17807@quack.suse.cz> <20140729212333.GO6754@linux.intel.com> <20140730095229.GA19205@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Jan Kara Return-path: Content-Disposition: inline In-Reply-To: <20140730095229.GA19205@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed, Jul 30, 2014 at 11:52:29AM +0200, Jan Kara wrote: > I see the problem now. How about an attached patch? Do you see other > lockdep warnings with it? Hit another one :-( Same inversion between i_mmap_mutex and jbd2_handle: -> #1 (&mapping->i_mmap_mutex){+.+...}: [] lock_acquire+0xb2/0x1f0 [] mutex_lock_nested+0x75/0x420 [] rmap_walk+0x6f/0x390 [] page_mkclean+0x69/0x90 [] clear_page_dirty_for_io+0x60/0x120 [] mpage_submit_page+0x47/0x80 [ext4] [] mpage_process_page_bufs+0x110/0x120 [ext4] [] mpage_prepare_extent_to_map+0x1f0/0x2f0 [ext4] [] ext4_writepages+0x427/0x1060 [ext4] [] do_writepages+0x21/0x40 [] __filemap_fdatawrite_range+0x59/0x60 [] filemap_write_and_wait_range+0x2d/0x70 [] ext4_sync_file+0x118/0x490 [ext4] [] vfs_fsync_range+0x1b/0x30 [] SyS_msync+0x1ed/0x250 (ext4_writepages starts a transaction before calling mpage_prepare_extent_to_map) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Mon, 11 Aug 2014 10:51:47 +0200 Message-ID: <20140811085147.GB29526@quack.suse.cz> References: <20140409102758.GM32103@quack.suse.cz> <20140409205111.GG5727@linux.intel.com> <20140409214331.GQ32103@quack.suse.cz> <20140729121259.GL6754@linux.intel.com> <20140729210457.GA17807@quack.suse.cz> <20140729212333.GO6754@linux.intel.com> <20140730095229.GA19205@quack.suse.cz> <20140809110000.GA32313@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <20140809110000.GA32313@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Sat 09-08-14 07:00:00, Matthew Wilcox wrote: > On Wed, Jul 30, 2014 at 11:52:29AM +0200, Jan Kara wrote: > > I see the problem now. How about an attached patch? Do you see other > > lockdep warnings with it? > > Hit another one :-( Same inversion between i_mmap_mutex and jbd2_handle: > > -> #1 (&mapping->i_mmap_mutex){+.+...}: > [] lock_acquire+0xb2/0x1f0 > [] mutex_lock_nested+0x75/0x420 > [] rmap_walk+0x6f/0x390 > [] page_mkclean+0x69/0x90 > [] clear_page_dirty_for_io+0x60/0x120 > [] mpage_submit_page+0x47/0x80 [ext4] > [] mpage_process_page_bufs+0x110/0x120 [ext4] > [] mpage_prepare_extent_to_map+0x1f0/0x2f0 [ext4] > [] ext4_writepages+0x427/0x1060 [ext4] > [] do_writepages+0x21/0x40 > [] __filemap_fdatawrite_range+0x59/0x60 > [] filemap_write_and_wait_range+0x2d/0x70 > [] ext4_sync_file+0x118/0x490 [ext4] > [] vfs_fsync_range+0x1b/0x30 > [] SyS_msync+0x1ed/0x250 > > (ext4_writepages starts a transaction before calling > mpage_prepare_extent_to_map) Hum, yes, this is difficult. Getting rid of clear_page_dirty_for_io() when the transaction is started isn't easily possible :(. So I'm afraid we'll have to find some other way to synchronize page faults and truncate / punch hole in DAX. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Mon, 11 Aug 2014 10:13:08 -0400 Message-ID: <20140811141308.GZ6754@linux.intel.com> References: <20140409102758.GM32103@quack.suse.cz> <20140409205111.GG5727@linux.intel.com> <20140409214331.GQ32103@quack.suse.cz> <20140729121259.GL6754@linux.intel.com> <20140729210457.GA17807@quack.suse.cz> <20140729212333.GO6754@linux.intel.com> <20140730095229.GA19205@quack.suse.cz> <20140809110000.GA32313@linux.intel.com> <20140811085147.GB29526@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Jan Kara Return-path: Content-Disposition: inline In-Reply-To: <20140811085147.GB29526@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Mon, Aug 11, 2014 at 10:51:47AM +0200, Jan Kara wrote: > So I'm afraid we'll have to find some other way to synchronize > page faults and truncate / punch hole in DAX. What if we don't? If we hit the race (which is vanishingly unlikely with real applications), the consequence is simply that after a truncate, a file may be left with one or two blocks allocated somewhere after i_size. As I understand it, that's not a real problem; they're temporarily unavailable for allocation but will be freed on file removal or the next truncation of that file. I'm also still considering the possibility of having truncate-down block until all mmaps that extend after the new i_size have been removed ... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Mon, 11 Aug 2014 16:35:00 +0200 Message-ID: <20140811143500.GF29526@quack.suse.cz> References: <20140409102758.GM32103@quack.suse.cz> <20140409205111.GG5727@linux.intel.com> <20140409214331.GQ32103@quack.suse.cz> <20140729121259.GL6754@linux.intel.com> <20140729210457.GA17807@quack.suse.cz> <20140729212333.GO6754@linux.intel.com> <20140730095229.GA19205@quack.suse.cz> <20140809110000.GA32313@linux.intel.com> <20140811085147.GB29526@quack.suse.cz> <20140811141308.GZ6754@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <20140811141308.GZ6754@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Mon 11-08-14 10:13:08, Matthew Wilcox wrote: > On Mon, Aug 11, 2014 at 10:51:47AM +0200, Jan Kara wrote: > > So I'm afraid we'll have to find some other way to synchronize > > page faults and truncate / punch hole in DAX. > > What if we don't? If we hit the race (which is vanishingly unlikely with > real applications), the consequence is simply that after a truncate, a > file may be left with one or two blocks allocated somewhere after i_size. > As I understand it, that's not a real problem; they're temporarily > unavailable for allocation but will be freed on file removal or the next > truncation of that file. You mean if you won't have any locking between page fault and truncate? You can have: a) extending truncate making forgotten blocks with non-zeros visible b) filesystem corruption due to doubly used blocks (block will be freed from the truncated file and thus can be reallocated but it will still be accessible via mmap from the truncated file). So not a good idea. > I'm also still considering the possibility of having truncate-down block > until all mmaps that extend after the new i_size have been removed ... Hum, I'm not sure how you would do that with current locking scheme and wait for all page faults on that range to finish but maybe you have some good idea :) Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Mon, 11 Aug 2014 11:02:05 -0400 Message-ID: <20140811150205.GA6754@linux.intel.com> References: <20140409205111.GG5727@linux.intel.com> <20140409214331.GQ32103@quack.suse.cz> <20140729121259.GL6754@linux.intel.com> <20140729210457.GA17807@quack.suse.cz> <20140729212333.GO6754@linux.intel.com> <20140730095229.GA19205@quack.suse.cz> <20140809110000.GA32313@linux.intel.com> <20140811085147.GB29526@quack.suse.cz> <20140811141308.GZ6754@linux.intel.com> <20140811143500.GF29526@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Jan Kara Return-path: Content-Disposition: inline In-Reply-To: <20140811143500.GF29526@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Mon, Aug 11, 2014 at 04:35:00PM +0200, Jan Kara wrote: > On Mon 11-08-14 10:13:08, Matthew Wilcox wrote: > > On Mon, Aug 11, 2014 at 10:51:47AM +0200, Jan Kara wrote: > > > So I'm afraid we'll have to find some other way to synchronize > > > page faults and truncate / punch hole in DAX. > > > > What if we don't? If we hit the race (which is vanishingly unlikely with > > real applications), the consequence is simply that after a truncate, a > > file may be left with one or two blocks allocated somewhere after i_size. > > As I understand it, that's not a real problem; they're temporarily > > unavailable for allocation but will be freed on file removal or the next > > truncation of that file. > You mean if you won't have any locking between page fault and truncate? > You can have: > a) extending truncate making forgotten blocks with non-zeros visible > b) filesystem corruption due to doubly used blocks (block will be freed > from the truncated file and thus can be reallocated but it will still be > accessible via mmap from the truncated file). > > So not a good idea. Not *no* locking ... just no locking around get_block, like in v7. So check i_size, call get_block, lock i_mmap_mutex, re-check i_size, insert mapping if i_size is OK, drop i_mmap_mutex. As long as get_block() has enough locking of its own against set_size and concurrent calls to get_block(), I don't think we can get visible non-zeroes or double allocation. > > I'm also still considering the possibility of having truncate-down block > > until all mmaps that extend after the new i_size have been removed ... > Hum, I'm not sure how you would do that with current locking scheme and > wait for all page faults on that range to finish but maybe you have some > good idea :) While it can be blocked with i_dio_count currently, this would be a more complicated thing to do ... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Mon, 11 Aug 2014 17:25:01 +0200 Message-ID: <20140811152501.GA12279@quack.suse.cz> References: <20140409214331.GQ32103@quack.suse.cz> <20140729121259.GL6754@linux.intel.com> <20140729210457.GA17807@quack.suse.cz> <20140729212333.GO6754@linux.intel.com> <20140730095229.GA19205@quack.suse.cz> <20140809110000.GA32313@linux.intel.com> <20140811085147.GB29526@quack.suse.cz> <20140811141308.GZ6754@linux.intel.com> <20140811143500.GF29526@quack.suse.cz> <20140811150205.GA6754@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <20140811150205.GA6754@linux.intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Mon 11-08-14 11:02:05, Matthew Wilcox wrote: > On Mon, Aug 11, 2014 at 04:35:00PM +0200, Jan Kara wrote: > > On Mon 11-08-14 10:13:08, Matthew Wilcox wrote: > > > On Mon, Aug 11, 2014 at 10:51:47AM +0200, Jan Kara wrote: > > > > So I'm afraid we'll have to find some other way to synchronize > > > > page faults and truncate / punch hole in DAX. > > > > > > What if we don't? If we hit the race (which is vanishingly unlikely with > > > real applications), the consequence is simply that after a truncate, a > > > file may be left with one or two blocks allocated somewhere after i_size. > > > As I understand it, that's not a real problem; they're temporarily > > > unavailable for allocation but will be freed on file removal or the next > > > truncation of that file. > > You mean if you won't have any locking between page fault and truncate? > > You can have: > > a) extending truncate making forgotten blocks with non-zeros visible > > b) filesystem corruption due to doubly used blocks (block will be freed > > from the truncated file and thus can be reallocated but it will still be > > accessible via mmap from the truncated file). > > > > So not a good idea. > > Not *no* locking ... just no locking around get_block, like in v7. > So check i_size, call get_block, lock i_mmap_mutex, re-check i_size, > insert mapping if i_size is OK, drop i_mmap_mutex. As long as get_block() > has enough locking of its own against set_size and concurrent calls > to get_block(), I don't think we can get visible non-zeroes or double > allocation. Ah, right. Now I remember. Yes, that solution will only occasionally leave allocated blocks beyond EOF. That may be acceptable especially if we mark the file with some flag and truncate those blocks after file is closed in ext4_release_file(). Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f177.google.com (mail-pd0-f177.google.com [209.85.192.177]) by kanga.kvack.org (Postfix) with ESMTP id 0888C6B010E for ; Sun, 23 Mar 2014 15:09:23 -0400 (EDT) Received: by mail-pd0-f177.google.com with SMTP id y10so4393749pdj.22 for ; Sun, 23 Mar 2014 12:09:23 -0700 (PDT) Received: from mga02.intel.com (mga02.intel.com. [134.134.136.20]) by mx.google.com with ESMTP id se7si7625602pbb.139.2014.03.23.12.09.22 for ; Sun, 23 Mar 2014 12:09:22 -0700 (PDT) From: Matthew Wilcox Subject: [PATCH v7 10/22] Remove get_xip_mem Date: Sun, 23 Mar 2014 15:08:36 -0400 Message-Id: In-Reply-To: References: In-Reply-To: References: Sender: owner-linux-mm@kvack.org List-ID: To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Matthew Wilcox , willy@linux.intel.com All callers of get_xip_mem() are now gone. Remove checks for it, initialisers of it, documentation of it and the only implementation of it. Add documentation for writing a filesystem that supports DAX. Signed-off-by: Matthew Wilcox Reviewed-by: Randy Dunlap --- Documentation/filesystems/Locking | 3 -- Documentation/filesystems/dax.txt | 82 +++++++++++++++++++++++++++++++++++++++ Documentation/filesystems/xip.txt | 71 --------------------------------- fs/exofs/inode.c | 1 - fs/ext2/inode.c | 1 - fs/ext2/xip.c | 37 ------------------ fs/ext2/xip.h | 3 -- fs/open.c | 5 +-- include/linux/fs.h | 2 - mm/fadvise.c | 6 ++- mm/madvise.c | 2 +- 11 files changed, 88 insertions(+), 125 deletions(-) create mode 100644 Documentation/filesystems/dax.txt delete mode 100644 Documentation/filesystems/xip.txt diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 5b0c083..2780d47 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -194,8 +194,6 @@ prototypes: void (*freepage)(struct page *); int (*direct_IO)(int, struct kiocb *, const struct iovec *iov, loff_t offset, unsigned long nr_segs); - int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **, - unsigned long *); int (*migratepage)(struct address_space *, struct page *, struct page *); int (*launder_page)(struct page *); int (*is_partially_uptodate)(struct page *, read_descriptor_t *, unsigned long); @@ -220,7 +218,6 @@ invalidatepage: yes releasepage: yes freepage: yes direct_IO: -get_xip_mem: maybe migratepage: yes (both) launder_page: yes is_partially_uptodate: yes diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt new file mode 100644 index 0000000..06f84e5 --- /dev/null +++ b/Documentation/filesystems/dax.txt @@ -0,0 +1,82 @@ +Execute-in-place for file mappings +---------------------------------- + +Motivation +---------- + +File mappings are usually performed by mapping page cache pages to +userspace. In addition, read & write file operations also transfer data +between the page cache and storage. + +For memory backed storage devices that use the block device interface, +the page cache pages are just copies of the original storage. The +execute-in-place code removes the extra copy by performing reads and +writes directly on the memory backed storage device. For file mappings, +the storage device itself is mapped directly into userspace. + + +Implementation Tips for Block Driver Writers +-------------------------------------------- + +To support DAX in your block driver, implement the 'direct_access' +block device operation. It is used to translate the sector number +(expressed in units of 512-byte sectors) to a page frame number (pfn) +that identifies the physical page for the memory. It also returns a +kernel virtual address that can be used to access the memory. + +The direct_access method takes a 'size' parameter that indicates the +number of bytes being requested. The function should return the number +of bytes that it can provide, although it must not exceed the number of +bytes requested. It may also return a negative errno if an error occurs. + +In order to support this method, the storage must be byte-accessible by +the CPU at all times. If your device uses paging techniques to expose +a large amount of memory through a smaller window, then you cannot +implement direct_access. Equally, if your device can occasionally +stall the CPU for an extended period, you should also not attempt to +implement direct_access. + +These block devices may be used for inspiration: +- axonram: Axon DDR2 device driver +- brd: RAM backed block device driver +- dcssblk: s390 dcss block device driver + + +Implementation Tips for Filesystem Writers +------------------------------------------ + +Filesystem support consists of +- adding support to mark inodes as being DAX by setting the S_DAX flag in + i_flags +- implementing the direct_IO address space operation, and calling + dax_do_io() instead of blockdev_direct_IO() if S_DAX is set +- implementing an mmap file operation for DAX files which sets the + VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers + for fault and page_mkwrite (which should probably call dax_fault() and + dax_mkwrite(), passing the appropriate get_block() callback) +- calling dax_truncate_page() instead of block_truncate_page() for DAX files +- ensuring that there is sufficient locking between reads, writes, + truncates and page faults + +The get_block() callback passed to the DAX functions may return +uninitialised extents. If it does, it must ensure that simultaneous +calls to get_block() (for example by a page-fault racing with a read() +or a write()) work correctly. + +These filesystems may be used for inspiration: +- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt + + +Shortcomings +------------ + +Even if the kernel or its modules are stored on a filesystem that supports +DAX on a block device that supports DAX, they will still be copied into RAM. + +Calling get_user_pages() on a range of user memory that has been mmaped +from a DAX file will fail as there are no 'struct page' to describe +those pages. This problem is being worked on. That means that O_DIRECT +reads/writes to those memory ranges from a non-DAX file will fail (note +that O_DIRECT reads/writes _of a DAX file_ do work, it is the memory +that is being accessed that is key here). Other things that will not +work include RDMA, sendfile() and splice(). diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt deleted file mode 100644 index b62eabf..0000000 --- a/Documentation/filesystems/xip.txt +++ /dev/null @@ -1,71 +0,0 @@ -Execute-in-place for file mappings ----------------------------------- - -Motivation ----------- -File mappings are performed by mapping page cache pages to userspace. In -addition, read&write type file operations also transfer data from/to the page -cache. - -For memory backed storage devices that use the block device interface, the page -cache pages are in fact copies of the original storage. Various approaches -exist to work around the need for an extra copy. The ramdisk driver for example -does read the data into the page cache, keeps a reference, and discards the -original data behind later on. - -Execute-in-place solves this issue the other way around: instead of keeping -data in the page cache, the need to have a page cache copy is eliminated -completely. With execute-in-place, read&write type operations are performed -directly from/to the memory backed storage device. For file mappings, the -storage device itself is mapped directly into userspace. - -This implementation was initially written for shared memory segments between -different virtual machines on s390 hardware to allow multiple machines to -share the same binaries and libraries. - -Implementation --------------- -Execute-in-place is implemented in three steps: block device operation, -address space operation, and file operations. - -A block device operation named direct_access is used to translate the -block device sector number to a page frame number (pfn) that identifies -the physical page for the memory. It also returns a kernel virtual -address that can be used to access the memory. - -The direct_access method takes a 'size' parameter that indicates the -number of bytes being requested. The function should return the number -of bytes that it can provide, although it must not exceed the number of -bytes requested. It may also return a negative errno if an error occurs. - -The block device operation is optional, these block devices support it as of -today: -- dcssblk: s390 dcss block device driver - -An address space operation named get_xip_mem is used to retrieve references -to a page frame number and a kernel address. To obtain these values a reference -to an address_space is provided. This function assigns values to the kmem and -pfn parameters. The third argument indicates whether the function should allocate -blocks if needed. - -This address space operation is mutually exclusive with readpage&writepage that -do page cache read/write operations. -The following filesystems support it as of today: -- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt - -A set of file operations that do utilize get_xip_page can be found in -mm/filemap_xip.c . The following file operation implementations are provided: -- aio_read/aio_write -- readv/writev -- sendfile - -The generic file operations do_sync_read/do_sync_write can be used to implement -classic synchronous IO calls. - -Shortcomings ------------- -This implementation is limited to storage devices that are cpu addressable at -all times (no highmem or such). It works well on rom/ram, but enhancements are -needed to make it work with flash in read+write mode. -Putting the Linux kernel and/or its modules on a xip filesystem does not mean -they are not copied. diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c index ee4317fa..f9a5bf6 100644 --- a/fs/exofs/inode.c +++ b/fs/exofs/inode.c @@ -985,7 +985,6 @@ const struct address_space_operations exofs_aops = { .direct_IO = exofs_direct_IO, /* With these NULL has special meaning or default is not exported */ - .get_xip_mem = NULL, .migratepage = NULL, .launder_page = NULL, .is_partially_uptodate = NULL, diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 252481f..b156fe8 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -891,7 +891,6 @@ const struct address_space_operations ext2_aops = { const struct address_space_operations ext2_aops_xip = { .bmap = ext2_bmap, - .get_xip_mem = ext2_get_xip_mem, .direct_IO = ext2_direct_IO, }; diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c index fa40091..ca745ff 100644 --- a/fs/ext2/xip.c +++ b/fs/ext2/xip.c @@ -22,27 +22,6 @@ static inline long __inode_direct_access(struct inode *inode, sector_t block, return ops->direct_access(bdev, sector, kaddr, pfn, size); } -static inline int -__ext2_get_block(struct inode *inode, pgoff_t pgoff, int create, - sector_t *result) -{ - struct buffer_head tmp; - int rc; - - memset(&tmp, 0, sizeof(struct buffer_head)); - tmp.b_size = 1 << inode->i_blkbits; - rc = ext2_get_block(inode, pgoff, &tmp, create); - *result = tmp.b_blocknr; - - /* did we get a sparse block (hole in the file)? */ - if (!tmp.b_blocknr && !rc) { - BUG_ON(create); - rc = -ENODATA; - } - - return rc; -} - int ext2_clear_xip_target(struct inode *inode, sector_t block) { @@ -69,19 +48,3 @@ void ext2_xip_verify_sb(struct super_block *sb) "not supported by bdev"); } } - -int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create, - void **kmem, unsigned long *pfn) -{ - long rc; - sector_t block; - - /* first, retrieve the sector number */ - rc = __ext2_get_block(mapping->host, pgoff, create, &block); - if (rc) - return rc; - - /* retrieve address of the target data */ - rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE); - return (rc < 0) ? rc : 0; -} diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h index 29be737..0fa8b7f 100644 --- a/fs/ext2/xip.h +++ b/fs/ext2/xip.h @@ -14,11 +14,8 @@ static inline int ext2_use_xip (struct super_block *sb) struct ext2_sb_info *sbi = EXT2_SB(sb); return (sbi->s_mount_opt & EXT2_MOUNT_XIP); } -int ext2_get_xip_mem(struct address_space *, pgoff_t, int, - void **, unsigned long *); #else #define ext2_xip_verify_sb(sb) do { } while (0) #define ext2_use_xip(sb) 0 #define ext2_clear_xip_target(inode, chain) 0 -#define ext2_get_xip_mem NULL #endif diff --git a/fs/open.c b/fs/open.c index b9ed8b2..bc9f002 100644 --- a/fs/open.c +++ b/fs/open.c @@ -665,11 +665,8 @@ int open_check_o_direct(struct file *f) { /* NB: we're sure to have correct a_ops only after f_op->open */ if (f->f_flags & O_DIRECT) { - if (!f->f_mapping->a_ops || - ((!f->f_mapping->a_ops->direct_IO) && - (!f->f_mapping->a_ops->get_xip_mem))) { + if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO) return -EINVAL; - } } return 0; } diff --git a/include/linux/fs.h b/include/linux/fs.h index 9752ae5..c777056 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -375,8 +375,6 @@ struct address_space_operations { void (*freepage)(struct page *); ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, loff_t offset, unsigned long nr_segs); - int (*get_xip_mem)(struct address_space *, pgoff_t, int, - void **, unsigned long *); /* * migrate the contents of a page to the specified target. If * migrate_mode is MIGRATE_ASYNC, it must not block. diff --git a/mm/fadvise.c b/mm/fadvise.c index 3bcfd81..1f1925f 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -28,6 +28,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) { struct fd f = fdget(fd); + struct inode *inode; struct address_space *mapping; struct backing_dev_info *bdi; loff_t endbyte; /* inclusive */ @@ -39,7 +40,8 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) if (!f.file) return -EBADF; - if (S_ISFIFO(file_inode(f.file)->i_mode)) { + inode = file_inode(f.file); + if (S_ISFIFO(inode->i_mode)) { ret = -ESPIPE; goto out; } @@ -50,7 +52,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) goto out; } - if (mapping->a_ops->get_xip_mem) { + if (IS_DAX(inode)) { switch (advice) { case POSIX_FADV_NORMAL: case POSIX_FADV_RANDOM: diff --git a/mm/madvise.c b/mm/madvise.c index 539eeb9..b6a2f52 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -236,7 +236,7 @@ static long madvise_willneed(struct vm_area_struct *vma, if (!file) return -EBADF; - if (file->f_mapping->a_ops->get_xip_mem) { + if (IS_DAX(file_inode(file))) { /* no bad return value, but ignore advice */ return 0; } -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f180.google.com (mail-pd0-f180.google.com [209.85.192.180]) by kanga.kvack.org (Postfix) with ESMTP id A69A56B00F0 for ; Wed, 2 Apr 2014 15:24:49 -0400 (EDT) Received: by mail-pd0-f180.google.com with SMTP id v10so616563pde.39 for ; Wed, 02 Apr 2014 12:24:49 -0700 (PDT) Received: from mga09.intel.com (mga09.intel.com. [134.134.136.24]) by mx.google.com with ESMTP id nm5si1789209pbc.466.2014.04.02.12.24.48 for ; Wed, 02 Apr 2014 12:24:48 -0700 (PDT) Date: Wed, 2 Apr 2014 15:24:46 -0400 From: Matthew Wilcox Subject: Re: [PATCH v7 03/22] axonram: Fix bug in direct_access Message-ID: <20140402192446.GC27299@linux.intel.com> References: <20140329162216.GC1211@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140329162216.GC1211@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Sat, Mar 29, 2014 at 05:22:16PM +0100, Jan Kara wrote: > On Sun 23-03-14 15:08:29, Matthew Wilcox wrote: > > The 'pfn' returned by axonram was completely bogus, and has been since > > 2008. > Maybe time to drop the driver instead? When noone noticed for 6 years, it > seems pretty much dead... Or is there some possibility the driver can get > reused for new HW? It may be in use, just not with the -o xip option to ext2 ... I can't find out which of the various vendors on the internet that are called 'Axon' that this device was originally supposed to support. I suspect it's dead, since it's DDR-2, but *shrug*, it costs little to fix it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f52.google.com (mail-pa0-f52.google.com [209.85.220.52]) by kanga.kvack.org (Postfix) with ESMTP id E58246B00F2 for ; Wed, 2 Apr 2014 15:28:02 -0400 (EDT) Received: by mail-pa0-f52.google.com with SMTP id rd3so637464pab.39 for ; Wed, 02 Apr 2014 12:28:02 -0700 (PDT) Received: from mga11.intel.com (mga11.intel.com. [192.55.52.93]) by mx.google.com with ESMTP id ta1si1822034pab.31.2014.04.02.12.28.01 for ; Wed, 02 Apr 2014 12:28:01 -0700 (PDT) Date: Wed, 2 Apr 2014 15:27:59 -0400 From: Matthew Wilcox Subject: Re: [PATCH v7 04/22] Change direct_access calling convention Message-ID: <20140402192759.GD27299@linux.intel.com> References: <214af2a38d840d0b8e983d39d03711d1292bc2d6.1395591795.git.matthew.r.wilcox@intel.com> <20140329163028.GD1211@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140329163028.GD1211@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Sat, Mar 29, 2014 at 05:30:28PM +0100, Jan Kara wrote: > > @@ -379,7 +379,9 @@ static int brd_direct_access(struct block_device *bdev, sector_t sector, > > *kaddr = page_address(page); > > *pfn = page_to_pfn(page); > > > > - return 0; > > + /* Could optimistically check to see if the next page in the > > + * file is mapped to the next page of physical RAM */ > > + return PAGE_SIZE; > This should be min_t(long, PAGE_SIZE, size), shouldn't it? Yes, it should. In practice, I don't think anyone's calling it with size < PAGE_SIZE, but we might as well future-proof it. > > @@ -866,25 +866,26 @@ fail: > > bio_io_error(bio); > > } > > > > -static int > > +static long > > dcssblk_direct_access (struct block_device *bdev, sector_t secnum, > > - void **kaddr, unsigned long *pfn) > > + void **kaddr, unsigned long *pfn, long size) > > { > > struct dcssblk_dev_info *dev_info; > > - unsigned long pgoff; > > + unsigned long offset, dev_sz; > > - return 0; > > + return min_t(unsigned long, size, dev_sz - offset); > ^^^ Why unsigned? Everything seems to be long... offset is unsigned long ... but might as well do the comparison in signed as unsigned. 'size' shouldn't be passed in as < 0 anyway. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f173.google.com (mail-we0-f173.google.com [74.125.82.173]) by kanga.kvack.org (Postfix) with ESMTP id 1BC756B0031 for ; Wed, 9 Apr 2014 06:15:19 -0400 (EDT) Received: by mail-we0-f173.google.com with SMTP id w61so2215337wes.18 for ; Wed, 09 Apr 2014 03:15:17 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id db6si2489015wib.25.2014.04.09.03.15.15 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 09 Apr 2014 03:15:16 -0700 (PDT) Date: Wed, 9 Apr 2014 12:15:12 +0200 From: Jan Kara Subject: Re: [PATCH v7 18/22] xip: Add xip_zero_page_range Message-ID: <20140409101512.GL32103@quack.suse.cz> References: <5a87acda8c3e4d2b7ea5dd1249fcbf8be23b9645.1395591795.git.matthew.r.wilcox@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5a87acda8c3e4d2b7ea5dd1249fcbf8be23b9645.1395591795.git.matthew.r.wilcox@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com, Ross Zwisler On Sun 23-03-14 15:08:44, Matthew Wilcox wrote: > This new function allows us to support hole-punch for XIP files by zeroing > a partial page, as opposed to the xip_truncate_page() function which can > only truncate to the end of the page. Reimplement xip_truncate_page() as > a macro that calls xip_zero_page_range(). > > Signed-off-by: Matthew Wilcox > [ported to 3.13-rc2] > Signed-off-by: Ross Zwisler Two comments below... ... > diff --git a/fs/dax.c b/fs/dax.c > index 45a0a41..2d6b4bc 100644 > --- a/fs/dax.c > +++ b/fs/dax.c ... > @@ -491,11 +494,16 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) > if (buffer_written(&bh)) { > void *addr; > err = dax_get_addr(inode, &bh, &addr); > - if (err) > + if (err < 0) > return err; > + /* > + * ext4 sometimes asks to zero past the end of a block. It > + * really just wants to zero to the end of the block. > + */ Then we should really fix ext4 I believe... > + length = min_t(unsigned, length, PAGE_CACHE_SIZE - offset); > memset(addr + offset, 0, length); > } > > return 0; > } > -EXPORT_SYMBOL_GPL(dax_truncate_page); > +EXPORT_SYMBOL_GPL(dax_zero_page_range); > diff --git a/include/linux/fs.h b/include/linux/fs.h > index bff394d..d0381ab 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2521,6 +2521,7 @@ extern int nonseekable_open(struct inode * inode, struct file * filp); > > #ifdef CONFIG_FS_DAX > int dax_clear_blocks(struct inode *, sector_t block, long size); > +int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t); > int dax_truncate_page(struct inode *, loff_t from, get_block_t); > ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *, > loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags); > @@ -2532,7 +2533,8 @@ static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz) > return 0; > } > > -static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb) > +static inline int dax_zero_page_range(struct inode *inode, loff_t from, > + unsigned len, get_block_t gb) > { > return 0; > } > @@ -2545,6 +2547,11 @@ static inline ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode, > } > #endif > > +/* Can't be a function because PAGE_CACHE_SIZE is defined in pagemap.h */ > +#define dax_truncate_page(inode, from, get_block) \ > + dax_zero_page_range(inode, from, PAGE_CACHE_SIZE, get_block) ^^^^ This should be (PAGE_CACHE_SIZE - (from & (PAGE_CACHE_SIZE - 1))), shouldn't it? > + > + > #ifdef CONFIG_BLOCK > typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode, Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f180.google.com (mail-pd0-f180.google.com [209.85.192.180]) by kanga.kvack.org (Postfix) with ESMTP id B71606B0031 for ; Wed, 9 Apr 2014 16:49:06 -0400 (EDT) Received: by mail-pd0-f180.google.com with SMTP id v10so2902009pde.25 for ; Wed, 09 Apr 2014 13:49:05 -0700 (PDT) Received: from mga02.intel.com (mga02.intel.com. [134.134.136.20]) by mx.google.com with ESMTP id jg5si977821pbb.254.2014.04.09.13.49.04 for ; Wed, 09 Apr 2014 13:49:04 -0700 (PDT) Date: Wed, 9 Apr 2014 16:48:06 -0400 From: Matthew Wilcox Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Message-ID: <20140409204806.GF5727@linux.intel.com> References: <20140408220525.GC26019@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140408220525.GC26019@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Wed, Apr 09, 2014 at 12:05:25AM +0200, Jan Kara wrote: > > + if (!page) > > + return VM_FAULT_OOM; > > + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; > > + if (vmf->pgoff >= size) { > Maybe comment here that we have to recheck i_size so that we don't create > pages in the area truncate_pagecache() has already evicted. Done. > > + dax_get_addr(inode, bh, &vfrom); /* XXX: error handling */ > The error handling here is missing as the comment suggests :) Added. > > + if (buffer_unwritten(&bh) || buffer_new(&bh)) > > + dax_clear_blocks(inode, bh.b_blocknr, bh.b_size); > Where is dax_clear_blocks() defined? Er ... patch 11. I'll reorder the patches ;-) > > + > > + error = dax_get_pfn(inode, &bh, &pfn); > > + if (error > 0) > > + error = vm_insert_mixed(vma, vaddr, pfn); > When there's a hole (thus page != NULL) and we are called from > dax_mkwrite(), this will always return EBUSY, correct? Erm ... it will return -EBUSY if this was the task that previously faulted on it. Drat. See below. > > + mutex_unlock(&mapping->i_mmap_mutex); > > + > > + if (page) { > > + delete_from_page_cache(page); > > + unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, > > + PAGE_CACHE_SIZE, 0); > Here we unmap the PTE pointing to the hole page but then we'll have to > retry the fault again to fill in the pfn we've got? This seems wrong. I'd > say we want to remap the PTE from the hole page to a pfn we've got while > holding i_mmap_mutex. remap_pfn_range() almost does what you need, except > that you also need that to work for normal pages. So you might need to > create a new helper in mm layer for that. I think it's easier than that. How does this look? @@ -390,9 +389,8 @@ static int do_dax_fault(struct vm_area_struct *vma, struct v dax_clear_blocks(inode, bh.b_blocknr, bh.b_size); error = dax_get_pfn(&bh, &pfn, blkbits); - if (error > 0) - error = vm_insert_mixed(vma, vaddr, pfn); - mutex_unlock(&mapping->i_mmap_mutex); + if (error <= 0) + goto unlock; if (page) { delete_from_page_cache(page); @@ -402,6 +400,9 @@ static int do_dax_fault(struct vm_area_struct *vma, struct v page_cache_release(page); } + error = vm_insert_mixed(vma, vaddr, pfn); + mutex_unlock(&mapping->i_mmap_mutex); + if (error == -ENOMEM) return VM_FAULT_OOM; /* -EBUSY is fine, somebody else faulted on the same PTE */ @@ -409,6 +410,8 @@ static int do_dax_fault(struct vm_area_struct *vma, struct v BUG_ON(error); return VM_FAULT_NOPAGE | major; + unlock: + mutex_unlock(&mapping->i_mmap_mutex); sigbus: if (page) { unlock_page(page); > > +int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > > + get_block_t get_block) > > +{ > > + int result; > > + struct super_block *sb = file_inode(vma->vm_file)->i_sb; > > + > > + sb_start_pagefault(sb); > You don't need any filesystem freeze protection for the fault handler > since that's not going to modify the filesystem. Err ... we might allocate a block as a result of doing a write to a hole. Or does that not count as 'modifying the filesystem' in this context? > > + file_update_time(vma->vm_file); > Why do you update m/ctime? We are only reading the file... ... except that it might be a write fault. I think we modify the file iff we return VM_FAULT_MAJOR from do_dax_fault(). So I'd be open to something like this: sb_start_pagefault(sb); result = do_dax_fault(vma, vmf, get_block); if (result & VM_FAULT_MAJOR) file_update_time(vma->vm_file); sb_end_pagefault(sb); Would that work better for you? > > @@ -70,7 +101,7 @@ const struct file_operations ext2_file_operations = { > > #ifdef CONFIG_COMPAT > > .compat_ioctl = ext2_compat_ioctl, > > #endif > > - .mmap = generic_file_mmap, > > + .mmap = ext2_file_mmap, > So what's the point of ext2_file_operations ever handling IS_DAX() > inodes? Actually ext2_file_operations and ext2_xip_file_operations seem to > be the same after this patch so either you drop ext2_xip_file_operations > (I'm for this) or you can leave generic_file_mmap here and assume > ext2_file_mmap is always called for IS_DAX() inodes. The goal is to get them the same. At this point, the only sticky point is: .splice_read = generic_file_splice_read, .splice_write = generic_file_splice_write, And splice is pretty damn sticky for DAX. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f172.google.com (mail-pd0-f172.google.com [209.85.192.172]) by kanga.kvack.org (Postfix) with ESMTP id 2E83C6B0038 for ; Thu, 10 Apr 2014 10:24:21 -0400 (EDT) Received: by mail-pd0-f172.google.com with SMTP id p10so3922030pdj.31 for ; Thu, 10 Apr 2014 07:24:15 -0700 (PDT) Received: from mga09.intel.com (mga09.intel.com. [134.134.136.24]) by mx.google.com with ESMTP id ua2si2304284pab.241.2014.04.10.07.24.14 for ; Thu, 10 Apr 2014 07:24:15 -0700 (PDT) Date: Thu, 10 Apr 2014 10:22:54 -0400 From: Matthew Wilcox Subject: Re: [PATCH v7 12/22] ext2: Remove ext2_xip_verify_sb() Message-ID: <20140410142254.GI5727@linux.intel.com> References: <5f91cb658e1ee1b593be9fd719e8f204b0069031.1395591795.git.matthew.r.wilcox@intel.com> <20140409095254.GE32103@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140409095254.GE32103@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Wed, Apr 09, 2014 at 11:52:54AM +0200, Jan Kara wrote: > > - if ((sbi->s_mount_opt ^ old_mount_opt) & EXT2_MOUNT_XIP) { > > + if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) { > > ext2_msg(sb, KERN_WARNING, "warning: refusing change of " > > "xip flag with busy inodes while remounting"); > > - sbi->s_mount_opt &= ~EXT2_MOUNT_XIP; > > - sbi->s_mount_opt |= old_mount_opt & EXT2_MOUNT_XIP; > > + sbi->s_mount_opt ^= EXT2_MOUNT_XIP; > Although this is correct, it was easier to see that the previous code is > correct so I'd prefer if you kept it that way. Depends how you think about it. I think of foo ^= bar as 'toggle the bar bit in foo'. So I read the code as 'If the mount bit is incorrect, print an error and toggle the bit'. I think you're reading the old code as 'If the new mount bit differs from the old mount bit, make sure the new mount bit is the same as the old mount bit'. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f43.google.com (mail-pa0-f43.google.com [209.85.220.43]) by kanga.kvack.org (Postfix) with ESMTP id EEB6E6B0038 for ; Thu, 10 Apr 2014 10:26:31 -0400 (EDT) Received: by mail-pa0-f43.google.com with SMTP id bj1so4053614pad.30 for ; Thu, 10 Apr 2014 07:26:31 -0700 (PDT) Received: from mga03.intel.com (mga03.intel.com. [143.182.124.21]) by mx.google.com with ESMTP id tt4si2323513pac.21.2014.04.10.07.26.30 for ; Thu, 10 Apr 2014 07:26:31 -0700 (PDT) Date: Thu, 10 Apr 2014 10:26:25 -0400 From: Matthew Wilcox Subject: Re: [PATCH v7 17/22] Get rid of most mentions of XIP in ext2 Message-ID: <20140410142625.GK5727@linux.intel.com> References: <0b13a744db9bfca33938bc1576f7eb7bfc9c41c2.1395591795.git.matthew.r.wilcox@intel.com> <20140409100435.GJ32103@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140409100435.GJ32103@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Wed, Apr 09, 2014 at 12:04:35PM +0200, Jan Kara wrote: > On Sun 23-03-14 15:08:43, Matthew Wilcox wrote: > > The only remaining usage is userspace's 'xip' option. > Looks good. You can add: > Reviewed-by: Jan Kara I've been thinking about this patch, and I'm not happy with it any more :-) I want to migrate people away from using 'xip' to 'dax' without breaking anybody's scripts. So I'm thinking about adding a new 'dax' option and having the 'xip' option print a warning and force-enable the 'dax' option. That way people who might have scripts to look for 'xip' in /proc/mounts won't break. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f49.google.com (mail-pa0-f49.google.com [209.85.220.49]) by kanga.kvack.org (Postfix) with ESMTP id 2BD3D6B0031 for ; Tue, 17 Jun 2014 14:24:09 -0400 (EDT) Received: by mail-pa0-f49.google.com with SMTP id lj1so5914988pab.8 for ; Tue, 17 Jun 2014 11:24:08 -0700 (PDT) Received: from mga09.intel.com (mga09.intel.com. [134.134.136.24]) by mx.google.com with ESMTP id qd5si15293692pbb.211.2014.06.17.11.24.07 for ; Tue, 17 Jun 2014 11:24:08 -0700 (PDT) Date: Tue, 17 Jun 2014 14:19:25 -0400 From: Matthew Wilcox Subject: Re: [PATCH v7 00/22] Support ext4 on NV-DIMMs Message-ID: <20140617181925.GF12025@linux.intel.com> References: <53A084E3.6080103@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <53A084E3.6080103@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Boaz Harrosh Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Tue, Jun 17, 2014 at 09:11:47PM +0300, Boaz Harrosh wrote: > Looking at the brd code I fail to see how it will ever support NV_DIMMS. > brd is "struct page" based and shares RAM from the same memory pool as the rest > of the system. But NV_DIMMS is not page-based and is excluded from the > memory system. It needs to be exclusively owned by a device and the mounted > FS. > > We currently have in our lab the old DDR3 based NV_DIMMS and on regular boot > it appears as RAM. We need to use memmap= option on command line of Kernel > to exclude it from use by Kernel. > > We have received our DDR4 based NV_DIMMS but still waiting for the actual > system board to support it. As I understand from STD documentation > these devices will not identify as RAM and will be exported as ACPI or > SBUS devices that can be queried for sizes and address as well as properties > about the chips. So I imagine a udev rule will need to probe the right driver > to mount over those. > > So currently from what I can see only the infamous PMFS is the setup that > can actually mount/support my NV_DIMMS today. > > It seems to me like we need a *new* block device that receives, like PMFS, > an physical_address + size on load and will export this raw region as a block > device. Of course with support of new DAX API. Should I send in such a device > code. > > (I've seen the linux-nvdimm project on github but did not see how my above > problem is addressed, it looks geared for that other type DDR bus devices) > > So please how is all that suppose to work, what is the strategy stack > for all this? I guess for now I'm stuck with PMFS. > > (BTW: A public git tree of DAX patches ;-) ) https://github.com/01org/prd should sort you out with both a git tree and a new block driver. You'll need to tell it manually what address range to use. I'm using it against regular DIMMs, and this works pretty well for me since my BIOS doesn't zero DRAM on reset. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f180.google.com (mail-pd0-f180.google.com [209.85.192.180]) by kanga.kvack.org (Postfix) with ESMTP id 42FA86B0036 for ; Tue, 29 Jul 2014 17:23:39 -0400 (EDT) Received: by mail-pd0-f180.google.com with SMTP id y13so265618pdi.25 for ; Tue, 29 Jul 2014 14:23:38 -0700 (PDT) Received: from mga02.intel.com (mga02.intel.com. [134.134.136.20]) by mx.google.com with ESMTP id eh7si108740pac.236.2014.07.29.14.23.38 for ; Tue, 29 Jul 2014 14:23:38 -0700 (PDT) Date: Tue, 29 Jul 2014 17:23:33 -0400 From: Matthew Wilcox Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Message-ID: <20140729212333.GO6754@linux.intel.com> References: <20140409102758.GM32103@quack.suse.cz> <20140409205111.GG5727@linux.intel.com> <20140409214331.GQ32103@quack.suse.cz> <20140729121259.GL6754@linux.intel.com> <20140729210457.GA17807@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140729210457.GA17807@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org On Tue, Jul 29, 2014 at 11:04:57PM +0200, Jan Kara wrote: > > Path 1: > > > > ext4_fallocate -> > > ext4_punch_hole -> > > ext4_inode_attach_jinode() -> ... -> > > lock_map_acquire(&handle->h_lockdep_map); > > truncate_pagecache_range() -> > > unmap_mapping_range() -> > > mutex_lock(&mapping->i_mmap_mutex); > This is strange. I don't see how ext4_inode_attach_jinode() can ever lead > to lock_map_acquire(&handle->h_lockdep_map). Can you post a full trace for > this? Unfortunately, lockdep finds the inversion in the other order, so I have the backtraces of this path hitting the i_mmap_mutex while already holding jbd_mutex: ====================================================== [ INFO: possible circular locking dependency detected ] 3.16.0-rc6+ #91 Tainted: G W ------------------------------------------------------- fstest/31836 is trying to acquire lock: (jbd2_handle){+.+.+.}, at: [] start_this_handle+0x193/0x630 [jbd2] but task is already holding lock: (&mapping->i_mmap_mutex){+.+...}, at: [] do_dax_fault+0x4e0/0x640 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&mapping->i_mmap_mutex){+.+...}: [] lock_acquire+0xb2/0x1f0 [] mutex_lock_nested+0x75/0x420 [] unmap_mapping_range+0x6b/0x180 [] truncate_pagecache_range+0x4a/0x60 [] ext4_punch_hole+0x4d1/0x530 [ext4] [] ext4_fallocate+0x156/0xb70 [ext4] [] do_fallocate+0x119/0x1b0 [] SyS_fallocate+0x43/0x70 [] system_call_fastpath+0x16/0x1b -> #0 (jbd2_handle){+.+.+.}: [] __lock_acquire+0x1d01/0x1eb0 [] lock_acquire+0xb2/0x1f0 [] start_this_handle+0x1ee/0x630 [jbd2] [] jbd2__journal_start+0xd4/0x260 [jbd2] [] __ext4_journal_start_sb+0x6d/0x190 [ext4] [] _ext4_get_block+0x16a/0x1c0 [ext4] [] ext4_get_block+0x16/0x20 [ext4] [] do_dax_fault+0x5d9/0x640 [] dax_fault+0x3f/0x90 [] ext4_dax_fault+0x15/0x20 [ext4] [] __do_fault+0x41/0xd0 [] do_shared_fault.isra.56+0x35/0x220 [] handle_mm_fault+0x303/0xf70 [] __do_page_fault+0x1ec/0x5b0 [] do_page_fault+0x22/0x30 [] page_fault+0x28/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&mapping->i_mmap_mutex); lock(jbd2_handle); lock(&mapping->i_mmap_mutex); lock(jbd2_handle); *** DEADLOCK *** 3 locks held by fstest/31836: #0: (&mm->mmap_sem){++++++}, at: [] __do_page_fault+0x182/0x5b0 #1: (sb_pagefaults){++++..}, at: [] dax_fault+0x7a/0x90 #2: (&mapping->i_mmap_mutex){+.+...}, at: [] do_dax_fault+0x4e0/0x640 stack backtrace: CPU: 6 PID: 31836 Comm: fstest Tainted: G W 3.16.0-rc6+ #91 Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./Q87M-D2H, BIOS F6 08/03/2013 ffffffff825e63e0 ffff8800a0fc78c0 ffffffff815c6bc3 ffffffff825e63e0 ffff8800a0fc7900 ffffffff815c4e59 ffff8800a0fc7970 ffff8800a88f4a50 ffff8800a88f4af8 ffff8800a88f5280 0000000000000003 ffff8800a88f5248 Call Trace: [] dump_stack+0x4d/0x66 [] print_circular_bug+0x201/0x20f [] __lock_acquire+0x1d01/0x1eb0 [] ? cyc2ns_read_end+0x20/0x20 [] lock_acquire+0xb2/0x1f0 [] ? start_this_handle+0x193/0x630 [jbd2] [] start_this_handle+0x1ee/0x630 [jbd2] [] ? start_this_handle+0x193/0x630 [jbd2] [] ? new_handle+0x20/0x60 [jbd2] [] jbd2__journal_start+0xd4/0x260 [jbd2] [] ? _ext4_get_block+0x16a/0x1c0 [ext4] [] __ext4_journal_start_sb+0x6d/0x190 [ext4] [] _ext4_get_block+0x16a/0x1c0 [ext4] [] ext4_get_block+0x16/0x20 [ext4] [] do_dax_fault+0x5d9/0x640 [] ? _ext4_get_block+0x1c0/0x1c0 [ext4] [] ? _ext4_get_block+0x1c0/0x1c0 [ext4] [] dax_fault+0x3f/0x90 [] ext4_dax_fault+0x15/0x20 [ext4] [] __do_fault+0x41/0xd0 [] do_shared_fault.isra.56+0x35/0x220 [] handle_mm_fault+0x303/0xf70 [] ? __lock_is_held+0x56/0x80 [] __do_page_fault+0x1ec/0x5b0 [] ? vm_mmap_pgoff+0x9c/0xc0 [] ? up_write+0x1f/0x40 [] ? vm_mmap_pgoff+0x9c/0xc0 [] ? trace_hardirqs_off_thunk+0x3a/0x3c [] do_page_fault+0x22/0x30 [] page_fault+0x28/0x30 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f180.google.com (mail-wi0-f180.google.com [209.85.212.180]) by kanga.kvack.org (Postfix) with ESMTP id 716FA6B0038 for ; Wed, 30 Jul 2014 05:52:35 -0400 (EDT) Received: by mail-wi0-f180.google.com with SMTP id n3so1972943wiv.1 for ; Wed, 30 Jul 2014 02:52:33 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id e5si25394805wib.78.2014.07.30.02.52.31 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 30 Jul 2014 02:52:32 -0700 (PDT) Date: Wed, 30 Jul 2014 11:52:29 +0200 From: Jan Kara Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Message-ID: <20140730095229.GA19205@quack.suse.cz> References: <20140409102758.GM32103@quack.suse.cz> <20140409205111.GG5727@linux.intel.com> <20140409214331.GQ32103@quack.suse.cz> <20140729121259.GL6754@linux.intel.com> <20140729210457.GA17807@quack.suse.cz> <20140729212333.GO6754@linux.intel.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="EeQfGwPcQSOJBaQU" Content-Disposition: inline In-Reply-To: <20140729212333.GO6754@linux.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Matthew Wilcox Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org --EeQfGwPcQSOJBaQU Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Tue 29-07-14 17:23:33, Matthew Wilcox wrote: > On Tue, Jul 29, 2014 at 11:04:57PM +0200, Jan Kara wrote: > > > Path 1: > > > > > > ext4_fallocate -> > > > ext4_punch_hole -> > > > ext4_inode_attach_jinode() -> ... -> > > > lock_map_acquire(&handle->h_lockdep_map); > > > truncate_pagecache_range() -> > > > unmap_mapping_range() -> > > > mutex_lock(&mapping->i_mmap_mutex); > > This is strange. I don't see how ext4_inode_attach_jinode() can ever lead > > to lock_map_acquire(&handle->h_lockdep_map). Can you post a full trace for > > this? > > Unfortunately, lockdep finds the inversion in the other order, so I > have the backtraces of this path hitting the i_mmap_mutex while already > holding jbd_mutex: I see the problem now. How about an attached patch? Do you see other lockdep warnings with it? Honza > > ====================================================== > [ INFO: possible circular locking dependency detected ] > 3.16.0-rc6+ #91 Tainted: G W > ------------------------------------------------------- > fstest/31836 is trying to acquire lock: > (jbd2_handle){+.+.+.}, at: [] start_this_handle+0x193/0x630 [jbd2] > > but task is already holding lock: > (&mapping->i_mmap_mutex){+.+...}, at: [] do_dax_fault+0x4e0/0x640 > > which lock already depends on the new lock. > > > the existing dependency chain (in reverse order) is: > > -> #1 (&mapping->i_mmap_mutex){+.+...}: > [] lock_acquire+0xb2/0x1f0 > [] mutex_lock_nested+0x75/0x420 > [] unmap_mapping_range+0x6b/0x180 > [] truncate_pagecache_range+0x4a/0x60 > [] ext4_punch_hole+0x4d1/0x530 [ext4] > [] ext4_fallocate+0x156/0xb70 [ext4] > [] do_fallocate+0x119/0x1b0 > [] SyS_fallocate+0x43/0x70 > [] system_call_fastpath+0x16/0x1b > > -> #0 (jbd2_handle){+.+.+.}: > [] __lock_acquire+0x1d01/0x1eb0 > [] lock_acquire+0xb2/0x1f0 > [] start_this_handle+0x1ee/0x630 [jbd2] > [] jbd2__journal_start+0xd4/0x260 [jbd2] > [] __ext4_journal_start_sb+0x6d/0x190 [ext4] > [] _ext4_get_block+0x16a/0x1c0 [ext4] > [] ext4_get_block+0x16/0x20 [ext4] > [] do_dax_fault+0x5d9/0x640 > [] dax_fault+0x3f/0x90 > [] ext4_dax_fault+0x15/0x20 [ext4] > [] __do_fault+0x41/0xd0 > [] do_shared_fault.isra.56+0x35/0x220 > [] handle_mm_fault+0x303/0xf70 > [] __do_page_fault+0x1ec/0x5b0 > [] do_page_fault+0x22/0x30 > [] page_fault+0x28/0x30 > > other info that might help us debug this: > > Possible unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(&mapping->i_mmap_mutex); > lock(jbd2_handle); > lock(&mapping->i_mmap_mutex); > lock(jbd2_handle); > > *** DEADLOCK *** > > 3 locks held by fstest/31836: > #0: (&mm->mmap_sem){++++++}, at: [] __do_page_fault+0x182/0x5b0 > #1: (sb_pagefaults){++++..}, at: [] dax_fault+0x7a/0x90 > #2: (&mapping->i_mmap_mutex){+.+...}, at: [] do_dax_fault+0x4e0/0x640 > > stack backtrace: > CPU: 6 PID: 31836 Comm: fstest Tainted: G W 3.16.0-rc6+ #91 > Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./Q87M-D2H, BIOS F6 08/03/2013 > ffffffff825e63e0 ffff8800a0fc78c0 ffffffff815c6bc3 ffffffff825e63e0 > ffff8800a0fc7900 ffffffff815c4e59 ffff8800a0fc7970 ffff8800a88f4a50 > ffff8800a88f4af8 ffff8800a88f5280 0000000000000003 ffff8800a88f5248 > Call Trace: > [] dump_stack+0x4d/0x66 > [] print_circular_bug+0x201/0x20f > [] __lock_acquire+0x1d01/0x1eb0 > [] ? cyc2ns_read_end+0x20/0x20 > [] lock_acquire+0xb2/0x1f0 > [] ? start_this_handle+0x193/0x630 [jbd2] > [] start_this_handle+0x1ee/0x630 [jbd2] > [] ? start_this_handle+0x193/0x630 [jbd2] > [] ? new_handle+0x20/0x60 [jbd2] > [] jbd2__journal_start+0xd4/0x260 [jbd2] > [] ? _ext4_get_block+0x16a/0x1c0 [ext4] > [] __ext4_journal_start_sb+0x6d/0x190 [ext4] > [] _ext4_get_block+0x16a/0x1c0 [ext4] > [] ext4_get_block+0x16/0x20 [ext4] > [] do_dax_fault+0x5d9/0x640 > [] ? _ext4_get_block+0x1c0/0x1c0 [ext4] > [] ? _ext4_get_block+0x1c0/0x1c0 [ext4] > [] dax_fault+0x3f/0x90 > [] ext4_dax_fault+0x15/0x20 [ext4] > [] __do_fault+0x41/0xd0 > [] do_shared_fault.isra.56+0x35/0x220 > [] handle_mm_fault+0x303/0xf70 > [] ? __lock_is_held+0x56/0x80 > [] __do_page_fault+0x1ec/0x5b0 > [] ? vm_mmap_pgoff+0x9c/0xc0 > [] ? up_write+0x1f/0x40 > [] ? vm_mmap_pgoff+0x9c/0xc0 > [] ? trace_hardirqs_off_thunk+0x3a/0x3c > [] do_page_fault+0x22/0x30 > [] page_fault+0x28/0x30 > -- Jan Kara SUSE Labs, CR --EeQfGwPcQSOJBaQU Content-Type: text/x-patch; charset=us-ascii Content-Disposition: attachment; filename="0001-ext4-Avoid-lock-inversion-between-i_mmap_mutex-and-t.patch" --EeQfGwPcQSOJBaQU-- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751544AbaCWTJD (ORCPT ); Sun, 23 Mar 2014 15:09:03 -0400 Received: from mga09.intel.com ([134.134.136.24]:29846 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751268AbaCWTI6 (ORCPT ); Sun, 23 Mar 2014 15:08:58 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,715,1389772800"; d="scan'208";a="497904553" From: Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Matthew Wilcox , willy@linux.intel.com Subject: [PATCH v7 00/22] Support ext4 on NV-DIMMs Date: Sun, 23 Mar 2014 15:08:26 -0400 Message-Id: X-Mailer: git-send-email 1.9.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org One of the primary uses for NV-DIMMs is to expose them as a block device and use a filesystem to store files on the NV-DIMM. While that works, it currently wastes memory and CPU time buffering the files in the page cache. We have support in ext2 for bypassing the page cache, but it has some races which are unfixable in the current design. This series of patches rewrite the underlying support, and add support for direct access to ext4. This iteration of the patchset rebases to Linus' 3.14-rc7 (plus Kirill's patches in linux-next http://marc.info/?l=linux-mm&m=139206489208546&w=2) and fixes several bugs: - Initialise cow_page in do_page_mkwrite() (Matthew Wilcox) - Clear new or unwritten blocks in page fault handler (Matthew Wilcox) - Only call get_block when necessary (Matthew Wilcox) - Reword Kconfig options (Matthew Wilcox / Vishal Verma) - Fix a race between page fault and truncate (Matthew Wilcox) - Fix a race between fault-for-read and fault-for-write (Matthew Wilcox) - Zero the correct bytes in dax_new_buf() (Toshi Kani) - Add DIO_LOCKING to an invocation of dax_do_io in ext4 (Ross Zwisler) Relative to the last patchset, I folded the 'Add reporting of major faults' patch into the patch that adds the DAX page fault handler. The v6 patchset had seven additional xfstests failures. This patchset now passes approximately as many xfstests as ext4 does on a ramdisk. Matthew Wilcox (21): Fix XIP fault vs truncate race Allow page fault handlers to perform the COW axonram: Fix bug in direct_access Change direct_access calling convention Introduce IS_DAX(inode) Replace XIP read and write with DAX I/O Replace the XIP page fault handler with the DAX page fault handler Replace xip_truncate_page with dax_truncate_page Remove mm/filemap_xip.c Remove get_xip_mem Replace ext2_clear_xip_target with dax_clear_blocks ext2: Remove ext2_xip_verify_sb() ext2: Remove ext2_use_xip ext2: Remove xip.c and xip.h Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX ext2: Remove ext2_aops_xip Get rid of most mentions of XIP in ext2 xip: Add xip_zero_page_range ext4: Make ext4_block_zero_page_range static ext4: Fix typos brd: Rename XIP to DAX Ross Zwisler (1): ext4: Add DAX functionality Documentation/filesystems/Locking | 3 - Documentation/filesystems/dax.txt | 84 ++++++ Documentation/filesystems/ext4.txt | 2 + Documentation/filesystems/xip.txt | 68 ----- arch/powerpc/sysdev/axonram.c | 8 +- drivers/block/Kconfig | 13 +- drivers/block/brd.c | 22 +- drivers/s390/block/dcssblk.c | 19 +- fs/Kconfig | 21 +- fs/Makefile | 1 + fs/dax.c | 509 +++++++++++++++++++++++++++++++++++++ fs/exofs/inode.c | 1 - fs/ext2/Kconfig | 11 - fs/ext2/Makefile | 1 - fs/ext2/ext2.h | 9 +- fs/ext2/file.c | 45 +++- fs/ext2/inode.c | 37 +-- fs/ext2/namei.c | 13 +- fs/ext2/super.c | 48 ++-- fs/ext2/xip.c | 91 ------- fs/ext2/xip.h | 26 -- fs/ext4/ext4.h | 8 +- fs/ext4/file.c | 53 +++- fs/ext4/indirect.c | 19 +- fs/ext4/inode.c | 94 ++++--- fs/ext4/namei.c | 10 +- fs/ext4/super.c | 39 ++- fs/open.c | 5 +- include/linux/blkdev.h | 4 +- include/linux/fs.h | 49 +++- include/linux/mm.h | 2 + mm/Makefile | 1 - mm/fadvise.c | 6 +- mm/filemap.c | 6 +- mm/filemap_xip.c | 483 ----------------------------------- mm/madvise.c | 2 +- mm/memory.c | 45 +++- 37 files changed, 984 insertions(+), 874 deletions(-) create mode 100644 Documentation/filesystems/dax.txt delete mode 100644 Documentation/filesystems/xip.txt create mode 100644 fs/dax.c delete mode 100644 fs/ext2/xip.c delete mode 100644 fs/ext2/xip.h delete mode 100644 mm/filemap_xip.c -- 1.9.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751732AbaCWTJL (ORCPT ); Sun, 23 Mar 2014 15:09:11 -0400 Received: from mga01.intel.com ([192.55.52.88]:9372 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751411AbaCWTI7 (ORCPT ); Sun, 23 Mar 2014 15:08:59 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,715,1389772800"; d="scan'208";a="505021411" From: Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Matthew Wilcox , willy@linux.intel.com Subject: [PATCH v7 02/22] Allow page fault handlers to perform the COW Date: Sun, 23 Mar 2014 15:08:28 -0400 Message-Id: X-Mailer: git-send-email 1.9.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Currently COW of an XIP file is done by first bringing in a read-only mapping, then retrying the fault and copying the page. It is much more efficient to tell the fault handler that a COW is being attempted (by passing in the pre-allocated page in the vm_fault structure), and allow the handler to perform the COW operation itself. Where the filemap code protects against truncation of the file until the PTE has been installed with the page lock, the XIP code use the i_mmap_mutex instead. We must therefore unlock the i_mmap_mutex after inserting the PTE. Signed-off-by: Matthew Wilcox --- include/linux/mm.h | 2 ++ mm/memory.c | 45 +++++++++++++++++++++++++++++++++------------ 2 files changed, 35 insertions(+), 12 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index c1b7414..513b78a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -205,6 +205,7 @@ struct vm_fault { pgoff_t pgoff; /* Logical page offset based on vma */ void __user *virtual_address; /* Faulting virtual address */ + struct page *cow_page; /* Handler may choose to COW */ struct page *page; /* ->fault handlers should return a * page here, unless VM_FAULT_NOPAGE * is set (which is also implied by @@ -1010,6 +1011,7 @@ static inline int page_mapped(struct page *page) #define VM_FAULT_HWPOISON 0x0010 /* Hit poisoned small page */ #define VM_FAULT_HWPOISON_LARGE 0x0020 /* Hit poisoned large page. Index encoded in upper bits */ +#define VM_FAULT_COWED 0x0080 /* ->fault COWed the page instead */ #define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */ #define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */ #define VM_FAULT_RETRY 0x0400 /* ->fault blocked, must retry */ diff --git a/mm/memory.c b/mm/memory.c index 07b4287..2a2ecac 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2602,6 +2602,7 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page, vmf.pgoff = page->index; vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE; vmf.page = page; + vmf.cow_page = NULL; ret = vma->vm_ops->page_mkwrite(vma, &vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) @@ -3288,7 +3289,8 @@ oom: } static int __do_fault(struct vm_area_struct *vma, unsigned long address, - pgoff_t pgoff, unsigned int flags, struct page **page) + pgoff_t pgoff, unsigned int flags, + struct page *cow_page, struct page **page) { struct vm_fault vmf; int ret; @@ -3297,10 +3299,13 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address, vmf.pgoff = pgoff; vmf.flags = flags; vmf.page = NULL; + vmf.cow_page = cow_page; ret = vma->vm_ops->fault(vma, &vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) return ret; + if (unlikely(ret & VM_FAULT_COWED)) + goto out; if (unlikely(PageHWPoison(vmf.page))) { if (ret & VM_FAULT_LOCKED) @@ -3314,6 +3319,7 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address, else VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page); + out: *page = vmf.page; return ret; } @@ -3351,7 +3357,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma, pte_t *pte; int ret; - ret = __do_fault(vma, address, pgoff, flags, &fault_page); + ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) return ret; @@ -3368,6 +3374,12 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma, return ret; } +/* + * If the fault handler performs the COW, it does not return a page, + * so cannot use the page's lock to protect against a concurrent truncate + * operation. Instead it returns with the i_mmap_mutex held, which must + * be released after the PTE has been inserted. + */ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pgoff_t pgoff, unsigned int flags, pte_t orig_pte) @@ -3389,25 +3401,34 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma, return VM_FAULT_OOM; } - ret = __do_fault(vma, address, pgoff, flags, &fault_page); + ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) goto uncharge_out; - copy_user_highpage(new_page, fault_page, address, vma); + if (!(ret & VM_FAULT_COWED)) + copy_user_highpage(new_page, fault_page, address, vma); __SetPageUptodate(new_page); pte = pte_offset_map_lock(mm, pmd, address, &ptl); - if (unlikely(!pte_same(*pte, orig_pte))) { - pte_unmap_unlock(pte, ptl); + if (unlikely(!pte_same(*pte, orig_pte))) + goto unlock_out; + do_set_pte(vma, address, new_page, pte, true, true); + pte_unmap_unlock(pte, ptl); + if (ret & VM_FAULT_COWED) { + mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex); + } else { unlock_page(fault_page); page_cache_release(fault_page); - goto uncharge_out; } - do_set_pte(vma, address, new_page, pte, true, true); - pte_unmap_unlock(pte, ptl); - unlock_page(fault_page); - page_cache_release(fault_page); return ret; +unlock_out: + pte_unmap_unlock(pte, ptl); + if (ret & VM_FAULT_COWED) { + mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex); + } else { + unlock_page(fault_page); + page_cache_release(fault_page); + } uncharge_out: mem_cgroup_uncharge_page(new_page); page_cache_release(new_page); @@ -3424,7 +3445,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma, int dirtied = 0; int ret, tmp; - ret = __do_fault(vma, address, pgoff, flags, &fault_page); + ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) return ret; -- 1.9.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752011AbaCWTKo (ORCPT ); Sun, 23 Mar 2014 15:10:44 -0400 Received: from mga09.intel.com ([134.134.136.24]:29846 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751328AbaCWTJH (ORCPT ); Sun, 23 Mar 2014 15:09:07 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,715,1389772800"; d="scan'208";a="505944121" From: Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Matthew Wilcox , Matthew Wilcox Subject: [PATCH v7 22/22] brd: Rename XIP to DAX Date: Sun, 23 Mar 2014 15:08:48 -0400 Message-Id: <7fd74703525f4077ed7c2b273ce6d082b03f0b61.1395591795.git.matthew.r.wilcox@intel.com> X-Mailer: git-send-email 1.9.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Matthew Wilcox Since this is relating to FS_XIP, not KERNEL_XIP, it should be called DAX instead of XIP. Signed-off-by: Matthew Wilcox --- drivers/block/Kconfig | 13 +++++++------ drivers/block/brd.c | 14 +++++++------- fs/Kconfig | 4 ++-- 3 files changed, 16 insertions(+), 15 deletions(-) diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index 014a1cf..1b8094d 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -393,14 +393,15 @@ config BLK_DEV_RAM_SIZE The default value is 4096 kilobytes. Only change this if you know what you are doing. -config BLK_DEV_XIP - bool "Support XIP filesystems on RAM block device" - depends on BLK_DEV_RAM +config BLK_DEV_RAM_DAX + bool "Support Direct Access (DAX) to RAM block devices" + depends on BLK_DEV_RAM && FS_DAX default n help - Support XIP filesystems (such as ext2 with XIP support on) on - top of block ram device. This will slightly enlarge the kernel, and - will prevent RAM block device backing store memory from being + Support filesystems using DAX to access RAM block devices. This + avoids double-buffering data in the page cache before copying it + to the block device. Answering Y will slightly enlarge the kernel, + and will prevent RAM block device backing store memory from being allocated from highmem (only a problem for highmem systems). config CDROM_PKTCDVD diff --git a/drivers/block/brd.c b/drivers/block/brd.c index 00da60d..619e0e0 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -97,13 +97,13 @@ static struct page *brd_insert_page(struct brd_device *brd, sector_t sector) * Must use NOIO because we don't want to recurse back into the * block or filesystem layers from page reclaim. * - * Cannot support XIP and highmem, because our ->direct_access - * routine for XIP must return memory that is always addressable. - * If XIP was reworked to use pfns and kmap throughout, this + * Cannot support DAX and highmem, because our ->direct_access + * routine for DAX must return memory that is always addressable. + * If DAX was reworked to use pfns and kmap throughout, this * restriction might be able to be lifted. */ gfp_flags = GFP_NOIO | __GFP_ZERO; -#ifndef CONFIG_BLK_DEV_XIP +#ifndef CONFIG_BLK_DEV_RAM_DAX gfp_flags |= __GFP_HIGHMEM; #endif page = alloc_page(gfp_flags); @@ -360,7 +360,7 @@ out: bio_endio(bio, err); } -#ifdef CONFIG_BLK_DEV_XIP +#ifdef CONFIG_BLK_DEV_RAM_DAX static long brd_direct_access(struct block_device *bdev, sector_t sector, void **kaddr, unsigned long *pfn, long size) { @@ -383,6 +383,8 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector, * file is mapped to the next page of physical RAM */ return PAGE_SIZE; } +#else +#define brd_direct_access NULL #endif static int brd_ioctl(struct block_device *bdev, fmode_t mode, @@ -422,9 +424,7 @@ static int brd_ioctl(struct block_device *bdev, fmode_t mode, static const struct block_device_operations brd_fops = { .owner = THIS_MODULE, .ioctl = brd_ioctl, -#ifdef CONFIG_BLK_DEV_XIP .direct_access = brd_direct_access, -#endif }; /* diff --git a/fs/Kconfig b/fs/Kconfig index 620ab73..376bd0a 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -34,7 +34,7 @@ source "fs/btrfs/Kconfig" source "fs/nilfs2/Kconfig" config FS_DAX - bool "Direct Access support" + bool "Direct Access (DAX) support" depends on MMU help Direct Access (DAX) can be used on memory-backed block devices. @@ -45,7 +45,7 @@ config FS_DAX If you do not have a block device that is capable of using this, or if unsure, say N. Saying Y will increase the size of the kernel - by about 2kB. + by about 5kB. endif # BLOCK -- 1.9.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752089AbaCWTKs (ORCPT ); Sun, 23 Mar 2014 15:10:48 -0400 Received: from mga09.intel.com ([134.134.136.24]:29849 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751625AbaCWTJG (ORCPT ); Sun, 23 Mar 2014 15:09:06 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,715,1389772800"; d="scan'208";a="478413332" From: Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Matthew Wilcox , willy@linux.intel.com Subject: [PATCH v7 17/22] Get rid of most mentions of XIP in ext2 Date: Sun, 23 Mar 2014 15:08:43 -0400 Message-Id: <0b13a744db9bfca33938bc1576f7eb7bfc9c41c2.1395591795.git.matthew.r.wilcox@intel.com> X-Mailer: git-send-email 1.9.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The only remaining usage is userspace's 'xip' option. --- fs/ext2/ext2.h | 6 +++--- fs/ext2/file.c | 2 +- fs/ext2/inode.c | 6 +++--- fs/ext2/namei.c | 8 ++++---- fs/ext2/super.c | 16 ++++++++-------- 5 files changed, 19 insertions(+), 19 deletions(-) diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h index b8b1c11..0e1fe9d 100644 --- a/fs/ext2/ext2.h +++ b/fs/ext2/ext2.h @@ -381,9 +381,9 @@ struct ext2_inode { #define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */ #define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */ #ifdef CONFIG_FS_DAX -#define EXT2_MOUNT_XIP 0x010000 /* Execute in place */ +#define EXT2_MOUNT_DAX 0x010000 /* Direct Access */ #else -#define EXT2_MOUNT_XIP 0 +#define EXT2_MOUNT_DAX 0 #endif #define EXT2_MOUNT_USRQUOTA 0x020000 /* user quota */ #define EXT2_MOUNT_GRPQUOTA 0x040000 /* group quota */ @@ -789,7 +789,7 @@ extern int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync); extern const struct inode_operations ext2_file_inode_operations; extern const struct file_operations ext2_file_operations; -extern const struct file_operations ext2_xip_file_operations; +extern const struct file_operations ext2_dax_file_operations; /* inode.c */ extern const struct address_space_operations ext2_aops; diff --git a/fs/ext2/file.c b/fs/ext2/file.c index ae7f000..f9bcb9b 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -110,7 +110,7 @@ const struct file_operations ext2_file_operations = { }; #ifdef CONFIG_FS_DAX -const struct file_operations ext2_xip_file_operations = { +const struct file_operations ext2_dax_file_operations = { .llseek = generic_file_llseek, .read = do_sync_read, .write = do_sync_write, diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 7ca76da..3776063 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -1285,7 +1285,7 @@ void ext2_set_inode_flags(struct inode *inode) inode->i_flags |= S_NOATIME; if (flags & EXT2_DIRSYNC_FL) inode->i_flags |= S_DIRSYNC; - if (test_opt(inode->i_sb, XIP)) + if (test_opt(inode->i_sb, DAX)) inode->i_flags |= S_DAX; } @@ -1387,9 +1387,9 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino) if (S_ISREG(inode->i_mode)) { inode->i_op = &ext2_file_inode_operations; - if (test_opt(inode->i_sb, XIP)) { + if (test_opt(inode->i_sb, DAX)) { inode->i_mapping->a_ops = &ext2_aops; - inode->i_fop = &ext2_xip_file_operations; + inode->i_fop = &ext2_dax_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; inode->i_fop = &ext2_file_operations; diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c index 0db888c..148f6e3 100644 --- a/fs/ext2/namei.c +++ b/fs/ext2/namei.c @@ -104,9 +104,9 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode return PTR_ERR(inode); inode->i_op = &ext2_file_inode_operations; - if (test_opt(inode->i_sb, XIP)) { + if (test_opt(inode->i_sb, DAX)) { inode->i_mapping->a_ops = &ext2_aops; - inode->i_fop = &ext2_xip_file_operations; + inode->i_fop = &ext2_dax_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; inode->i_fop = &ext2_file_operations; @@ -125,9 +125,9 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) return PTR_ERR(inode); inode->i_op = &ext2_file_inode_operations; - if (test_opt(inode->i_sb, XIP)) { + if (test_opt(inode->i_sb, DAX)) { inode->i_mapping->a_ops = &ext2_aops; - inode->i_fop = &ext2_xip_file_operations; + inode->i_fop = &ext2_dax_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; inode->i_fop = &ext2_file_operations; diff --git a/fs/ext2/super.c b/fs/ext2/super.c index fdcacf7..8062373 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -288,7 +288,7 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root) #endif #ifdef CONFIG_FS_DAX - if (sbi->s_mount_opt & EXT2_MOUNT_XIP) + if (sbi->s_mount_opt & EXT2_MOUNT_DAX) seq_puts(seq, ",xip"); #endif @@ -393,7 +393,7 @@ enum { Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic, Opt_err_ro, Opt_nouid32, Opt_nocheck, Opt_debug, Opt_oldalloc, Opt_orlov, Opt_nobh, Opt_user_xattr, Opt_nouser_xattr, - Opt_acl, Opt_noacl, Opt_xip, Opt_ignore, Opt_err, Opt_quota, + Opt_acl, Opt_noacl, Opt_dax, Opt_ignore, Opt_err, Opt_quota, Opt_usrquota, Opt_grpquota, Opt_reservation, Opt_noreservation }; @@ -421,7 +421,7 @@ static const match_table_t tokens = { {Opt_nouser_xattr, "nouser_xattr"}, {Opt_acl, "acl"}, {Opt_noacl, "noacl"}, - {Opt_xip, "xip"}, + {Opt_dax, "xip"}, {Opt_grpquota, "grpquota"}, {Opt_ignore, "noquota"}, {Opt_quota, "quota"}, @@ -548,9 +548,9 @@ static int parse_options(char *options, struct super_block *sb) "(no)acl options not supported"); break; #endif - case Opt_xip: + case Opt_dax: #ifdef CONFIG_FS_DAX - set_opt (sbi->s_mount_opt, XIP); + set_opt (sbi->s_mount_opt, DAX); #else ext2_msg(sb, KERN_INFO, "xip option not supported"); #endif @@ -896,7 +896,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size); - if (sbi->s_mount_opt & EXT2_MOUNT_XIP) { + if (sbi->s_mount_opt & EXT2_MOUNT_DAX) { if (blocksize != PAGE_SIZE) { ext2_msg(sb, KERN_ERR, "error: unsupported blocksize for xip"); @@ -1275,10 +1275,10 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data) ((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0); es = sbi->s_es; - if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) { + if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_DAX) { ext2_msg(sb, KERN_WARNING, "warning: refusing change of " "xip flag with busy inodes while remounting"); - sbi->s_mount_opt ^= EXT2_MOUNT_XIP; + sbi->s_mount_opt ^= EXT2_MOUNT_DAX; } if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) { spin_unlock(&sbi->s_lock); -- 1.9.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752054AbaCWTKr (ORCPT ); Sun, 23 Mar 2014 15:10:47 -0400 Received: from mga09.intel.com ([134.134.136.24]:29851 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751629AbaCWTJG (ORCPT ); Sun, 23 Mar 2014 15:09:06 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,715,1389772800"; d="scan'208";a="505944117" From: Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Matthew Wilcox , willy@linux.intel.com, Ross Zwisler Subject: [PATCH v7 18/22] xip: Add xip_zero_page_range Date: Sun, 23 Mar 2014 15:08:44 -0400 Message-Id: <5a87acda8c3e4d2b7ea5dd1249fcbf8be23b9645.1395591795.git.matthew.r.wilcox@intel.com> X-Mailer: git-send-email 1.9.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This new function allows us to support hole-punch for XIP files by zeroing a partial page, as opposed to the xip_truncate_page() function which can only truncate to the end of the page. Reimplement xip_truncate_page() as a macro that calls xip_zero_page_range(). Signed-off-by: Matthew Wilcox [ported to 3.13-rc2] Signed-off-by: Ross Zwisler --- Documentation/filesystems/dax.txt | 1 + fs/dax.c | 22 +++++++++++++++------- include/linux/fs.h | 9 ++++++++- 3 files changed, 24 insertions(+), 8 deletions(-) diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt index 06f84e5..e5706cc 100644 --- a/Documentation/filesystems/dax.txt +++ b/Documentation/filesystems/dax.txt @@ -55,6 +55,7 @@ Filesystem support consists of for fault and page_mkwrite (which should probably call dax_fault() and dax_mkwrite(), passing the appropriate get_block() callback) - calling dax_truncate_page() instead of block_truncate_page() for DAX files +- calling dax_zero_page_range() instead of zero_user() for DAX files - ensuring that there is sufficient locking between reads, writes, truncates and page faults diff --git a/fs/dax.c b/fs/dax.c index 45a0a41..2d6b4bc 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -457,13 +457,16 @@ int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, EXPORT_SYMBOL_GPL(dax_mkwrite); /** - * dax_truncate_page - handle a partial page being truncated in a DAX file + * dax_zero_page_range - zero a range within a page of a DAX file * @inode: The file being truncated * @from: The file offset that is being truncated to + * @length: The number of bytes to zero * @get_block: The filesystem method used to translate file offsets to blocks * - * Similar to block_truncate_page(), this function can be called by a - * filesystem when it is truncating an DAX file to handle the partial page. + * This function can be called by a filesystem when it is zeroing part of a + * page in a DAX file. This is intended for hole-punch operations. If + * you are truncating a file, the helper function dax_truncate_page() may be + * more convenient. * * We work in terms of PAGE_CACHE_SIZE here for commonality with * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem @@ -471,12 +474,12 @@ EXPORT_SYMBOL_GPL(dax_mkwrite); * block size is smaller than PAGE_SIZE, we have to zero the rest of the page * since the file might be mmaped. */ -int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) +int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length, + get_block_t get_block) { struct buffer_head bh; pgoff_t index = from >> PAGE_CACHE_SHIFT; unsigned offset = from & (PAGE_CACHE_SIZE-1); - unsigned length = PAGE_CACHE_ALIGN(from) - from; int err; /* Block boundary? Nothing to do */ @@ -491,11 +494,16 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) if (buffer_written(&bh)) { void *addr; err = dax_get_addr(inode, &bh, &addr); - if (err) + if (err < 0) return err; + /* + * ext4 sometimes asks to zero past the end of a block. It + * really just wants to zero to the end of the block. + */ + length = min_t(unsigned, length, PAGE_CACHE_SIZE - offset); memset(addr + offset, 0, length); } return 0; } -EXPORT_SYMBOL_GPL(dax_truncate_page); +EXPORT_SYMBOL_GPL(dax_zero_page_range); diff --git a/include/linux/fs.h b/include/linux/fs.h index bff394d..d0381ab 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2521,6 +2521,7 @@ extern int nonseekable_open(struct inode * inode, struct file * filp); #ifdef CONFIG_FS_DAX int dax_clear_blocks(struct inode *, sector_t block, long size); +int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t); int dax_truncate_page(struct inode *, loff_t from, get_block_t); ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *, loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags); @@ -2532,7 +2533,8 @@ static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz) return 0; } -static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb) +static inline int dax_zero_page_range(struct inode *inode, loff_t from, + unsigned len, get_block_t gb) { return 0; } @@ -2545,6 +2547,11 @@ static inline ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode, } #endif +/* Can't be a function because PAGE_CACHE_SIZE is defined in pagemap.h */ +#define dax_truncate_page(inode, from, get_block) \ + dax_zero_page_range(inode, from, PAGE_CACHE_SIZE, get_block) + + #ifdef CONFIG_BLOCK typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode, loff_t file_offset); -- 1.9.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751954AbaCWTKn (ORCPT ); Sun, 23 Mar 2014 15:10:43 -0400 Received: from mga09.intel.com ([134.134.136.24]:29849 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751636AbaCWTJH (ORCPT ); Sun, 23 Mar 2014 15:09:07 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,715,1389772800"; d="scan'208";a="478413340" From: Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Ross Zwisler , willy@linux.intel.com, Matthew Wilcox Subject: [PATCH v7 20/22] ext4: Add DAX functionality Date: Sun, 23 Mar 2014 15:08:46 -0400 Message-Id: <490bf3041f0e0633964ca84bf4fb0bb3dd999694.1395591795.git.matthew.r.wilcox@intel.com> X-Mailer: git-send-email 1.9.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Ross Zwisler This is a port of the DAX functionality found in the current version of ext2. Signed-off-by: Ross Zwisler Reviewed-by: Andreas Dilger [heavily tweaked] Signed-off-by: Matthew Wilcox --- Documentation/filesystems/dax.txt | 1 + Documentation/filesystems/ext4.txt | 2 ++ fs/ext4/ext4.h | 6 +++++ fs/ext4/file.c | 53 +++++++++++++++++++++++++++++++++----- fs/ext4/indirect.c | 19 +++++++++----- fs/ext4/inode.c | 52 ++++++++++++++++++++++++------------- fs/ext4/namei.c | 10 +++++-- fs/ext4/super.c | 39 +++++++++++++++++++++++++++- 8 files changed, 149 insertions(+), 33 deletions(-) diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt index e5706cc..619dab5 100644 --- a/Documentation/filesystems/dax.txt +++ b/Documentation/filesystems/dax.txt @@ -66,6 +66,7 @@ or a write()) work correctly. These filesystems may be used for inspiration: - ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt +- ext4: the fourth extended filesystem, see Documentation/filesystems/ext4.txt Shortcomings diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index 919a329..9c511c4 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt @@ -386,6 +386,8 @@ max_dir_size_kb=n This limits the size of directories so that any i_version Enable 64-bit inode version support. This option is off by default. +dax Use direct access if possible + Data Mode ========= There are 3 different data modes: diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index e025c29..00e9b79 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -966,6 +966,11 @@ struct ext4_inode_info { #define EXT4_MOUNT_ERRORS_MASK 0x00070 #define EXT4_MOUNT_MINIX_DF 0x00080 /* Mimics the Minix statfs */ #define EXT4_MOUNT_NOLOAD 0x00100 /* Don't use existing journal*/ +#ifdef CONFIG_FS_DAX +#define EXT4_MOUNT_DAX 0x00200 /* Execute in place */ +#else +#define EXT4_MOUNT_DAX 0 +#endif #define EXT4_MOUNT_DATA_FLAGS 0x00C00 /* Mode for data writes: */ #define EXT4_MOUNT_JOURNAL_DATA 0x00400 /* Write data to journal */ #define EXT4_MOUNT_ORDERED_DATA 0x00800 /* Flush data before commit */ @@ -2581,6 +2586,7 @@ extern const struct file_operations ext4_dir_operations; /* file.c */ extern const struct inode_operations ext4_file_inode_operations; extern const struct file_operations ext4_file_operations; +extern const struct file_operations ext4_dax_file_operations; extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin); extern void ext4_unwritten_wait(struct inode *inode); diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 1a50739..42a8ccd 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -190,7 +190,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov, } } - if (unlikely(iocb->ki_filp->f_flags & O_DIRECT)) + if (io_is_direct(iocb->ki_filp)) ret = ext4_file_dio_write(iocb, iov, nr_segs, pos); else ret = generic_file_aio_write(iocb, iov, nr_segs, pos); @@ -198,6 +198,27 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov, return ret; } +#ifdef CONFIG_FS_DAX +static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + return dax_fault(vma, vmf, ext4_get_block); + /* Is this the right get_block? */ +} + +static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + return dax_mkwrite(vma, vmf, ext4_get_block); +} + +static const struct vm_operations_struct ext4_dax_vm_ops = { + .fault = ext4_dax_fault, + .page_mkwrite = ext4_dax_mkwrite, + .remap_pages = generic_file_remap_pages, +}; +#else +#define ext4_dax_vm_ops ext4_file_vm_ops +#endif + static const struct vm_operations_struct ext4_file_vm_ops = { .fault = filemap_fault, .page_mkwrite = ext4_page_mkwrite, @@ -206,12 +227,13 @@ static const struct vm_operations_struct ext4_file_vm_ops = { static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) { - struct address_space *mapping = file->f_mapping; - - if (!mapping->a_ops->readpage) - return -ENOEXEC; file_accessed(file); - vma->vm_ops = &ext4_file_vm_ops; + if (IS_DAX(file_inode(file))) { + vma->vm_ops = &ext4_dax_vm_ops; + vma->vm_flags |= VM_MIXEDMAP; + } else { + vma->vm_ops = &ext4_file_vm_ops; + } return 0; } @@ -609,6 +631,25 @@ const struct file_operations ext4_file_operations = { .fallocate = ext4_fallocate, }; +#ifdef CONFIG_FS_DAX +const struct file_operations ext4_dax_file_operations = { + .llseek = ext4_llseek, + .read = do_sync_read, + .write = do_sync_write, + .aio_read = generic_file_aio_read, + .aio_write = ext4_file_write, + .unlocked_ioctl = ext4_ioctl, +#ifdef CONFIG_COMPAT + .compat_ioctl = ext4_compat_ioctl, +#endif + .mmap = ext4_file_mmap, + .open = ext4_file_open, + .release = ext4_release_file, + .fsync = ext4_sync_file, + .fallocate = ext4_fallocate, +}; +#endif + const struct inode_operations ext4_file_inode_operations = { .setattr = ext4_setattr, .getattr = ext4_getattr, diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c index 594009f..5fdb414 100644 --- a/fs/ext4/indirect.c +++ b/fs/ext4/indirect.c @@ -686,15 +686,22 @@ retry: inode_dio_done(inode); goto locked; } - ret = __blockdev_direct_IO(rw, iocb, inode, - inode->i_sb->s_bdev, iov, - offset, nr_segs, - ext4_get_block, NULL, NULL, 0); + if (IS_DAX(inode)) + ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs, + ext4_get_block, NULL, 0); + else + ret = __blockdev_direct_IO(rw, iocb, inode, + inode->i_sb->s_bdev, iov, offset, + nr_segs, ext4_get_block, NULL, NULL, 0); inode_dio_done(inode); } else { locked: - ret = blockdev_direct_IO(rw, iocb, inode, iov, - offset, nr_segs, ext4_get_block); + if (IS_DAX(inode)) + ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs, + ext4_get_block, NULL, DIO_LOCKING); + else + ret = blockdev_direct_IO(rw, iocb, inode, iov, + offset, nr_segs, ext4_get_block); if (unlikely((rw & WRITE) && ret < 0)) { loff_t isize = i_size_read(inode); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index ce7341c..9462730 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3140,13 +3140,14 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb, get_block_func = ext4_get_block_write; dio_flags = DIO_LOCKING; } - ret = __blockdev_direct_IO(rw, iocb, inode, - inode->i_sb->s_bdev, iov, - offset, nr_segs, - get_block_func, - ext4_end_io_dio, - NULL, - dio_flags); + if (IS_DAX(inode)) + ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs, + get_block_func, ext4_end_io_dio, dio_flags); + else + ret = __blockdev_direct_IO(rw, iocb, inode, + inode->i_sb->s_bdev, iov, offset, + nr_segs, get_block_func, + ext4_end_io_dio, NULL, dio_flags); /* * Put our reference to io_end. This can free the io_end structure e.g. @@ -3311,14 +3312,7 @@ void ext4_set_aops(struct inode *inode) inode->i_mapping->a_ops = &ext4_aops; } -/* - * ext4_block_zero_page_range() zeros out a mapping of length 'length' - * starting from file offset 'from'. The range to be zero'd must - * be contained with in one block. If the specified range exceeds - * the end of the block it will be shortened to end of the block - * that cooresponds to 'from' - */ -static int ext4_block_zero_page_range(handle_t *handle, +static int __ext4_block_zero_page_range(handle_t *handle, struct address_space *mapping, loff_t from, loff_t length) { ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT; @@ -3409,6 +3403,22 @@ unlock: } /* + * ext4_block_zero_page_range() zeros out a mapping of length 'length' + * starting from file offset 'from'. The range to be zero'd must + * be contained with in one block. If the specified range exceeds + * the end of the block it will be shortened to end of the block + * that cooresponds to 'from' + */ +static int ext4_block_zero_page_range(handle_t *handle, + struct address_space *mapping, loff_t from, loff_t length) +{ + struct inode *inode = mapping->host; + if (IS_DAX(inode)) + return dax_zero_page_range(inode, from, length, ext4_get_block); + return __ext4_block_zero_page_range(handle, mapping, from, length); +} + +/* * ext4_block_truncate_page() zeroes out a mapping from file offset `from' * up to the end of the block which corresponds to `from'. * This required during truncate. We need to physically zero the tail end @@ -3922,7 +3932,8 @@ void ext4_set_inode_flags(struct inode *inode) { unsigned int flags = EXT4_I(inode)->i_flags; - inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC); + inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | + S_DIRSYNC | S_DAX); if (flags & EXT4_SYNC_FL) inode->i_flags |= S_SYNC; if (flags & EXT4_APPEND_FL) @@ -3933,6 +3944,8 @@ void ext4_set_inode_flags(struct inode *inode) inode->i_flags |= S_NOATIME; if (flags & EXT4_DIRSYNC_FL) inode->i_flags |= S_DIRSYNC; + if (test_opt(inode->i_sb, DAX)) + inode->i_flags |= S_DAX; } /* Propagate flags from i_flags to EXT4_I(inode)->i_flags */ @@ -4184,7 +4197,10 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) if (S_ISREG(inode->i_mode)) { inode->i_op = &ext4_file_inode_operations; - inode->i_fop = &ext4_file_operations; + if (test_opt(inode->i_sb, DAX)) + inode->i_fop = &ext4_dax_file_operations; + else + inode->i_fop = &ext4_file_operations; ext4_set_aops(inode); } else if (S_ISDIR(inode->i_mode)) { inode->i_op = &ext4_dir_inode_operations; @@ -4640,7 +4656,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr) * Truncate pagecache after we've waited for commit * in data=journal mode to make pages freeable. */ - truncate_pagecache(inode, inode->i_size); + truncate_pagecache(inode, inode->i_size); } /* * We want to call ext4_truncate() even if attr->ia_size == diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index d050e04..acb9cca 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -2249,7 +2249,10 @@ retry: err = PTR_ERR(inode); if (!IS_ERR(inode)) { inode->i_op = &ext4_file_inode_operations; - inode->i_fop = &ext4_file_operations; + if (test_opt(inode->i_sb, DAX)) + inode->i_fop = &ext4_dax_file_operations; + else + inode->i_fop = &ext4_file_operations; ext4_set_aops(inode); err = ext4_add_nondir(handle, dentry, inode); if (!err && IS_DIRSYNC(dir)) @@ -2313,7 +2316,10 @@ retry: err = PTR_ERR(inode); if (!IS_ERR(inode)) { inode->i_op = &ext4_file_inode_operations; - inode->i_fop = &ext4_file_operations; + if (test_opt(inode->i_sb, DAX)) + inode->i_fop = &ext4_dax_file_operations; + else + inode->i_fop = &ext4_file_operations; ext4_set_aops(inode); d_tmpfile(dentry, inode); err = ext4_orphan_add(handle, inode); diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 710fed2..c0b7f4c 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1156,7 +1156,7 @@ enum { Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota, Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_jqfmt_vfsv1, Opt_quota, Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err, - Opt_usrquota, Opt_grpquota, Opt_i_version, + Opt_usrquota, Opt_grpquota, Opt_i_version, Opt_dax, Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit, Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity, Opt_inode_readahead_blks, Opt_journal_ioprio, @@ -1218,6 +1218,7 @@ static const match_table_t tokens = { {Opt_barrier, "barrier"}, {Opt_nobarrier, "nobarrier"}, {Opt_i_version, "i_version"}, + {Opt_dax, "dax"}, {Opt_stripe, "stripe=%u"}, {Opt_delalloc, "delalloc"}, {Opt_nodelalloc, "nodelalloc"}, @@ -1400,6 +1401,7 @@ static const struct mount_opts { {Opt_min_batch_time, 0, MOPT_GTE0}, {Opt_inode_readahead_blks, 0, MOPT_GTE0}, {Opt_init_itable, 0, MOPT_GTE0}, + {Opt_dax, EXT4_MOUNT_DAX, MOPT_SET}, {Opt_stripe, 0, MOPT_GTE0}, {Opt_resuid, 0, MOPT_GTE0}, {Opt_resgid, 0, MOPT_GTE0}, @@ -1638,6 +1640,11 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token, } sbi->s_jquota_fmt = m->mount_opt; #endif +#ifndef CONFIG_FS_DAX + } else if (token == Opt_dax) { + ext4_msg(sb, KERN_INFO, "dax option not supported"); + return -1; +#endif } else { if (!args->from) arg = 1; @@ -3560,6 +3567,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) "both data=journal and dioread_nolock"); goto failed_mount; } + if (test_opt(sb, DAX)) { + ext4_msg(sb, KERN_ERR, "can't mount with " + "both data=journal and dax"); + goto failed_mount; + } if (test_opt(sb, DELALLOC)) clear_opt(sb, DELALLOC); } @@ -3613,6 +3625,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) goto failed_mount; } + if (sbi->s_mount_opt & EXT4_MOUNT_DAX) { + if (blocksize != PAGE_SIZE) { + ext4_msg(sb, KERN_ERR, + "error: unsupported blocksize for dax"); + goto failed_mount; + } + if (!sb->s_bdev->bd_disk->fops->direct_access) { + ext4_msg(sb, KERN_ERR, + "error: device does not support dax"); + goto failed_mount; + } + } + if (sb->s_blocksize != blocksize) { /* Validate the filesystem blocksize */ if (!sb_set_blocksize(sb, blocksize)) { @@ -4813,6 +4838,18 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data) err = -EINVAL; goto restore_opts; } + if (test_opt(sb, DAX)) { + ext4_msg(sb, KERN_ERR, "can't mount with " + "both data=journal and dax"); + err = -EINVAL; + goto restore_opts; + } + } + + if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT4_MOUNT_DAX) { + ext4_msg(sb, KERN_WARNING, "warning: refusing change of " + "dax flag with busy inodes while remounting"); + sbi->s_mount_opt ^= EXT4_MOUNT_DAX; } if (sbi->s_mount_flags & EXT4_MF_FS_ABORTED) -- 1.9.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752225AbaCWTMS (ORCPT ); Sun, 23 Mar 2014 15:12:18 -0400 Received: from mga01.intel.com ([192.55.52.88]:9372 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751579AbaCWTJF (ORCPT ); Sun, 23 Mar 2014 15:09:05 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,715,1389772800"; d="scan'208";a="505021435" From: Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Matthew Wilcox , willy@linux.intel.com Subject: [PATCH v7 21/22] ext4: Fix typos Date: Sun, 23 Mar 2014 15:08:47 -0400 Message-Id: <2b2c5467283817503fede11d12cba8aef912c9c5.1395591795.git.matthew.r.wilcox@intel.com> X-Mailer: git-send-email 1.9.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Comment fix only Signed-off-by: Matthew Wilcox --- fs/ext4/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 9462730..14a9744 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3691,7 +3691,7 @@ void ext4_truncate(struct inode *inode) /* * There is a possibility that we're either freeing the inode - * or it completely new indode. In those cases we might not + * or it's a completely new inode. In those cases we might not * have i_mutex locked because it's not necessary. */ if (!(inode->i_state & (I_NEW|I_FREEING))) -- 1.9.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752304AbaCWTM6 (ORCPT ); Sun, 23 Mar 2014 15:12:58 -0400 Received: from mga09.intel.com ([134.134.136.24]:29851 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751561AbaCWTJE (ORCPT ); Sun, 23 Mar 2014 15:09:04 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,715,1389772800"; d="scan'208";a="505944113" From: Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Matthew Wilcox , willy@linux.intel.com Subject: [PATCH v7 16/22] ext2: Remove ext2_aops_xip Date: Sun, 23 Mar 2014 15:08:42 -0400 Message-Id: <0b6512aa46a504459f41d3c609fc20c93d4a911a.1395591795.git.matthew.r.wilcox@intel.com> X-Mailer: git-send-email 1.9.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org We shouldn't need a special address_space_operations any more Signed-off-by: Matthew Wilcox --- fs/ext2/ext2.h | 1 - fs/ext2/inode.c | 7 +------ fs/ext2/namei.c | 4 ++-- 3 files changed, 3 insertions(+), 9 deletions(-) diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h index b30c3bd..b8b1c11 100644 --- a/fs/ext2/ext2.h +++ b/fs/ext2/ext2.h @@ -793,7 +793,6 @@ extern const struct file_operations ext2_xip_file_operations; /* inode.c */ extern const struct address_space_operations ext2_aops; -extern const struct address_space_operations ext2_aops_xip; extern const struct address_space_operations ext2_nobh_aops; /* namei.c */ diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 67124f0..7ca76da 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -890,11 +890,6 @@ const struct address_space_operations ext2_aops = { .error_remove_page = generic_error_remove_page, }; -const struct address_space_operations ext2_aops_xip = { - .bmap = ext2_bmap, - .direct_IO = ext2_direct_IO, -}; - const struct address_space_operations ext2_nobh_aops = { .readpage = ext2_readpage, .readpages = ext2_readpages, @@ -1393,7 +1388,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino) if (S_ISREG(inode->i_mode)) { inode->i_op = &ext2_file_inode_operations; if (test_opt(inode->i_sb, XIP)) { - inode->i_mapping->a_ops = &ext2_aops_xip; + inode->i_mapping->a_ops = &ext2_aops; inode->i_fop = &ext2_xip_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c index 7ca803f..0db888c 100644 --- a/fs/ext2/namei.c +++ b/fs/ext2/namei.c @@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode inode->i_op = &ext2_file_inode_operations; if (test_opt(inode->i_sb, XIP)) { - inode->i_mapping->a_ops = &ext2_aops_xip; + inode->i_mapping->a_ops = &ext2_aops; inode->i_fop = &ext2_xip_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; @@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) inode->i_op = &ext2_file_inode_operations; if (test_opt(inode->i_sb, XIP)) { - inode->i_mapping->a_ops = &ext2_aops_xip; + inode->i_mapping->a_ops = &ext2_aops; inode->i_fop = &ext2_xip_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; -- 1.9.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752416AbaCWTND (ORCPT ); Sun, 23 Mar 2014 15:13:03 -0400 Received: from mga09.intel.com ([134.134.136.24]:29846 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751268AbaCWTJE (ORCPT ); Sun, 23 Mar 2014 15:09:04 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,715,1389772800"; d="scan'208";a="505944106" From: Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Matthew Wilcox , willy@linux.intel.com Subject: [PATCH v7 13/22] ext2: Remove ext2_use_xip Date: Sun, 23 Mar 2014 15:08:39 -0400 Message-Id: <0c65dcd599646e3054d0c524a0c5b25b07885763.1395591795.git.matthew.r.wilcox@intel.com> X-Mailer: git-send-email 1.9.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Replace ext2_use_xip() with test_opt(XIP) which expands to the same code Signed-off-by: Matthew Wilcox --- fs/ext2/ext2.h | 4 ++++ fs/ext2/inode.c | 2 +- fs/ext2/namei.c | 4 ++-- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h index d9a17d0..5ecf570 100644 --- a/fs/ext2/ext2.h +++ b/fs/ext2/ext2.h @@ -380,7 +380,11 @@ struct ext2_inode { #define EXT2_MOUNT_NO_UID32 0x000200 /* Disable 32-bit UIDs */ #define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */ #define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */ +#ifdef CONFIG_FS_XIP #define EXT2_MOUNT_XIP 0x010000 /* Execute in place */ +#else +#define EXT2_MOUNT_XIP 0 +#endif #define EXT2_MOUNT_USRQUOTA 0x020000 /* user quota */ #define EXT2_MOUNT_GRPQUOTA 0x040000 /* group quota */ #define EXT2_MOUNT_RESERVATION 0x080000 /* Preallocation */ diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index a9346a9..2e587e2 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -1393,7 +1393,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino) if (S_ISREG(inode->i_mode)) { inode->i_op = &ext2_file_inode_operations; - if (ext2_use_xip(inode->i_sb)) { + if (test_opt(inode->i_sb, XIP)) { inode->i_mapping->a_ops = &ext2_aops_xip; inode->i_fop = &ext2_xip_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c index c268d0a..846c356 100644 --- a/fs/ext2/namei.c +++ b/fs/ext2/namei.c @@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode return PTR_ERR(inode); inode->i_op = &ext2_file_inode_operations; - if (ext2_use_xip(inode->i_sb)) { + if (test_opt(inode->i_sb, XIP)) { inode->i_mapping->a_ops = &ext2_aops_xip; inode->i_fop = &ext2_xip_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { @@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) return PTR_ERR(inode); inode->i_op = &ext2_file_inode_operations; - if (ext2_use_xip(inode->i_sb)) { + if (test_opt(inode->i_sb, XIP)) { inode->i_mapping->a_ops = &ext2_aops_xip; inode->i_fop = &ext2_xip_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { -- 1.9.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752378AbaCWTNB (ORCPT ); Sun, 23 Mar 2014 15:13:01 -0400 Received: from mga09.intel.com ([134.134.136.24]:29849 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751553AbaCWTJE (ORCPT ); Sun, 23 Mar 2014 15:09:04 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,715,1389772800"; d="scan'208";a="478413324" From: Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Matthew Wilcox , willy@linux.intel.com Subject: [PATCH v7 08/22] Replace xip_truncate_page with dax_truncate_page Date: Sun, 23 Mar 2014 15:08:34 -0400 Message-Id: X-Mailer: git-send-email 1.9.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org It takes a get_block parameter just like nobh_truncate_page() and block_truncate_page() Signed-off-by: Matthew Wilcox --- fs/dax.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++---- fs/ext2/inode.c | 2 +- include/linux/fs.h | 4 ++-- mm/filemap_xip.c | 40 ---------------------------------------- 4 files changed, 51 insertions(+), 47 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 863749c..7271be0 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -374,13 +374,13 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, } /** - * dax_fault - handle a page fault on an XIP file + * dax_fault - handle a page fault on a DAX file * @vma: The virtual memory area where the fault occurred * @vmf: The description of the fault * @get_block: The filesystem method used to translate file offsets to blocks * * When a page fault occurs, filesystems may call this helper in their - * fault handler for XIP files. + * fault handler for DAX files. */ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, get_block_t get_block) @@ -398,12 +398,12 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, EXPORT_SYMBOL_GPL(dax_fault); /** - * dax_mkwrite - convert a read-only page to read-write in an XIP file + * dax_mkwrite - convert a read-only page to read-write in a DAX file * @vma: The virtual memory area where the fault occurred * @vmf: The description of the fault * @get_block: The filesystem method used to translate file offsets to blocks * - * XIP handles reads of holes by adding pages full of zeroes into the + * DAX handles reads of holes by adding pages full of zeroes into the * mapping. If the page is subsequenty written to, we have to allocate * the page on media and free the page that was in the cache. */ @@ -421,3 +421,47 @@ int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, return result; } EXPORT_SYMBOL_GPL(dax_mkwrite); + +/** + * dax_truncate_page - handle a partial page being truncated in a DAX file + * @inode: The file being truncated + * @from: The file offset that is being truncated to + * @get_block: The filesystem method used to translate file offsets to blocks + * + * Similar to block_truncate_page(), this function can be called by a + * filesystem when it is truncating an DAX file to handle the partial page. + * + * We work in terms of PAGE_CACHE_SIZE here for commonality with + * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem + * took care of disposing of the unnecessary blocks. Even if the filesystem + * block size is smaller than PAGE_SIZE, we have to zero the rest of the page + * since the file might be mmaped. + */ +int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) +{ + struct buffer_head bh; + pgoff_t index = from >> PAGE_CACHE_SHIFT; + unsigned offset = from & (PAGE_CACHE_SIZE-1); + unsigned length = PAGE_CACHE_ALIGN(from) - from; + int err; + + /* Block boundary? Nothing to do */ + if (!length) + return 0; + + memset(&bh, 0, sizeof(bh)); + bh.b_size = PAGE_CACHE_SIZE; + err = get_block(inode, index, &bh, 0); + if (err < 0) + return err; + if (buffer_written(&bh)) { + void *addr; + err = dax_get_addr(inode, &bh, &addr); + if (err) + return err; + memset(addr + offset, 0, length); + } + + return 0; +} +EXPORT_SYMBOL_GPL(dax_truncate_page); diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index f128ebf..252481f 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -1207,7 +1207,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize) inode_dio_wait(inode); if (IS_DAX(inode)) - error = xip_truncate_page(inode->i_mapping, newsize); + error = dax_truncate_page(inode, newsize, ext2_get_block); else if (test_opt(inode->i_sb, NOBH)) error = nobh_truncate_page(inode->i_mapping, newsize, ext2_get_block); diff --git a/include/linux/fs.h b/include/linux/fs.h index 1607812..9752ae5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2522,13 +2522,13 @@ extern int generic_file_open(struct inode * inode, struct file * filp); extern int nonseekable_open(struct inode * inode, struct file * filp); #ifdef CONFIG_FS_XIP -extern int xip_truncate_page(struct address_space *mapping, loff_t from); +int dax_truncate_page(struct inode *, loff_t from, get_block_t); ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *, loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags); int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t); int dax_mkwrite(struct vm_area_struct *, struct vm_fault *, get_block_t); #else -static inline int xip_truncate_page(struct address_space *mapping, loff_t from) +static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb) { return 0; } diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c index 9dd45f3..6316578 100644 --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -21,43 +21,3 @@ #include #include -/* - * truncate a page used for execute in place - * functionality is analog to block_truncate_page but does use get_xip_mem - * to get the page instead of page cache - */ -int -xip_truncate_page(struct address_space *mapping, loff_t from) -{ - pgoff_t index = from >> PAGE_CACHE_SHIFT; - unsigned offset = from & (PAGE_CACHE_SIZE-1); - unsigned blocksize; - unsigned length; - void *xip_mem; - unsigned long xip_pfn; - int err; - - BUG_ON(!mapping->a_ops->get_xip_mem); - - blocksize = 1 << mapping->host->i_blkbits; - length = offset & (blocksize - 1); - - /* Block boundary? Nothing to do */ - if (!length) - return 0; - - length = blocksize - length; - - err = mapping->a_ops->get_xip_mem(mapping, index, 0, - &xip_mem, &xip_pfn); - if (unlikely(err)) { - if (err == -ENODATA) - /* Hole? No need to truncate */ - return 0; - else - return err; - } - memset(xip_mem + offset, 0, length); - return 0; -} -EXPORT_SYMBOL_GPL(xip_truncate_page); -- 1.9.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752646AbaCWTOW (ORCPT ); Sun, 23 Mar 2014 15:14:22 -0400 Received: from mga01.intel.com ([192.55.52.88]:9372 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751539AbaCWTJD (ORCPT ); Sun, 23 Mar 2014 15:09:03 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,715,1389772800"; d="scan'208";a="497904585" From: Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Matthew Wilcox , willy@linux.intel.com Subject: [PATCH v7 14/22] ext2: Remove xip.c and xip.h Date: Sun, 23 Mar 2014 15:08:40 -0400 Message-Id: <33ff0862f6d99b352429ef4494817544c3d5da68.1395591795.git.matthew.r.wilcox@intel.com> X-Mailer: git-send-email 1.9.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org These files are now empty, so delete them Signed-off-by: Matthew Wilcox --- fs/ext2/Makefile | 1 - fs/ext2/inode.c | 1 - fs/ext2/namei.c | 1 - fs/ext2/super.c | 1 - fs/ext2/xip.c | 15 --------------- fs/ext2/xip.h | 16 ---------------- 6 files changed, 35 deletions(-) delete mode 100644 fs/ext2/xip.c delete mode 100644 fs/ext2/xip.h diff --git a/fs/ext2/Makefile b/fs/ext2/Makefile index f42af45..445b0e9 100644 --- a/fs/ext2/Makefile +++ b/fs/ext2/Makefile @@ -10,4 +10,3 @@ ext2-y := balloc.o dir.o file.o ialloc.o inode.o \ ext2-$(CONFIG_EXT2_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o ext2-$(CONFIG_EXT2_FS_POSIX_ACL) += acl.o ext2-$(CONFIG_EXT2_FS_SECURITY) += xattr_security.o -ext2-$(CONFIG_EXT2_FS_XIP) += xip.o diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 2e587e2..67124f0 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -34,7 +34,6 @@ #include #include "ext2.h" #include "acl.h" -#include "xip.h" #include "xattr.h" static int __ext2_write_inode(struct inode *inode, int do_sync); diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c index 846c356..7ca803f 100644 --- a/fs/ext2/namei.c +++ b/fs/ext2/namei.c @@ -35,7 +35,6 @@ #include "ext2.h" #include "xattr.h" #include "acl.h" -#include "xip.h" static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode) { diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 3a1db39..752ccb4 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -35,7 +35,6 @@ #include "ext2.h" #include "xattr.h" #include "acl.h" -#include "xip.h" static void ext2_sync_super(struct super_block *sb, struct ext2_super_block *es, int wait); diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c deleted file mode 100644 index 66ca113..0000000 --- a/fs/ext2/xip.c +++ /dev/null @@ -1,15 +0,0 @@ -/* - * linux/fs/ext2/xip.c - * - * Copyright (C) 2005 IBM Corporation - * Author: Carsten Otte (cotte@de.ibm.com) - */ - -#include -#include -#include -#include -#include -#include "ext2.h" -#include "xip.h" - diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h deleted file mode 100644 index 87eeb04..0000000 --- a/fs/ext2/xip.h +++ /dev/null @@ -1,16 +0,0 @@ -/* - * linux/fs/ext2/xip.h - * - * Copyright (C) 2005 IBM Corporation - * Author: Carsten Otte (cotte@de.ibm.com) - */ - -#ifdef CONFIG_EXT2_FS_XIP -static inline int ext2_use_xip (struct super_block *sb) -{ - struct ext2_sb_info *sbi = EXT2_SB(sb); - return (sbi->s_mount_opt & EXT2_MOUNT_XIP); -} -#else -#define ext2_use_xip(sb) 0 -#endif -- 1.9.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752661AbaCWTOY (ORCPT ); Sun, 23 Mar 2014 15:14:24 -0400 Received: from mga01.intel.com ([192.55.52.88]:5159 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751543AbaCWTJD (ORCPT ); Sun, 23 Mar 2014 15:09:03 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,715,1389772800"; d="scan'208";a="497904588" From: Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Matthew Wilcox , willy@linux.intel.com Subject: [PATCH v7 15/22] Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX Date: Sun, 23 Mar 2014 15:08:41 -0400 Message-Id: X-Mailer: git-send-email 1.9.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The fewer Kconfig options we have the better. Use the generic CONFIG_FS_DAX to enable XIP support in ext2 as well as in the core. Signed-off-by: Matthew Wilcox --- fs/Kconfig | 21 ++++++++++++++------- fs/Makefile | 2 +- fs/ext2/Kconfig | 11 ----------- fs/ext2/ext2.h | 2 +- fs/ext2/file.c | 4 ++-- fs/ext2/super.c | 4 ++-- include/linux/fs.h | 4 ++-- 7 files changed, 22 insertions(+), 26 deletions(-) diff --git a/fs/Kconfig b/fs/Kconfig index 7385e54..620ab73 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -13,13 +13,6 @@ if BLOCK source "fs/ext2/Kconfig" source "fs/ext3/Kconfig" source "fs/ext4/Kconfig" - -config FS_XIP -# execute in place - bool - depends on EXT2_FS_XIP - default y - source "fs/jbd/Kconfig" source "fs/jbd2/Kconfig" @@ -40,6 +33,20 @@ source "fs/ocfs2/Kconfig" source "fs/btrfs/Kconfig" source "fs/nilfs2/Kconfig" +config FS_DAX + bool "Direct Access support" + depends on MMU + help + Direct Access (DAX) can be used on memory-backed block devices. + If the block device supports DAX and the filesystem supports DAX, + then you can avoid using the pagecache to buffer I/Os. Turning + on this option will compile in support for DAX; you will need to + mount the filesystem using the -o xip option. + + If you do not have a block device that is capable of using this, + or if unsure, say N. Saying Y will increase the size of the kernel + by about 2kB. + endif # BLOCK # Posix ACL utility routines diff --git a/fs/Makefile b/fs/Makefile index 2f194cd..b7e0a13 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -29,7 +29,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o obj-$(CONFIG_TIMERFD) += timerfd.o obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_AIO) += aio.o -obj-$(CONFIG_FS_XIP) += dax.o +obj-$(CONFIG_FS_DAX) += dax.o obj-$(CONFIG_FILE_LOCKING) += locks.o obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout.o diff --git a/fs/ext2/Kconfig b/fs/ext2/Kconfig index 14a6780..c634874e 100644 --- a/fs/ext2/Kconfig +++ b/fs/ext2/Kconfig @@ -42,14 +42,3 @@ config EXT2_FS_SECURITY If you are not using a security module that requires using extended attributes for file security labels, say N. - -config EXT2_FS_XIP - bool "Ext2 execute in place support" - depends on EXT2_FS && MMU - help - Execute in place can be used on memory-backed block devices. If you - enable this option, you can select to mount block devices which are - capable of this feature without using the page cache. - - If you do not use a block device that is capable of using this, - or if unsure, say N. diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h index 5ecf570..b30c3bd 100644 --- a/fs/ext2/ext2.h +++ b/fs/ext2/ext2.h @@ -380,7 +380,7 @@ struct ext2_inode { #define EXT2_MOUNT_NO_UID32 0x000200 /* Disable 32-bit UIDs */ #define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */ #define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */ -#ifdef CONFIG_FS_XIP +#ifdef CONFIG_FS_DAX #define EXT2_MOUNT_XIP 0x010000 /* Execute in place */ #else #define EXT2_MOUNT_XIP 0 diff --git a/fs/ext2/file.c b/fs/ext2/file.c index e3ce10d..ae7f000 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -25,7 +25,7 @@ #include "xattr.h" #include "acl.h" -#ifdef CONFIG_EXT2_FS_XIP +#ifdef CONFIG_FS_DAX static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { return dax_fault(vma, vmf, ext2_get_block); @@ -109,7 +109,7 @@ const struct file_operations ext2_file_operations = { .splice_write = generic_file_splice_write, }; -#ifdef CONFIG_EXT2_FS_XIP +#ifdef CONFIG_FS_DAX const struct file_operations ext2_xip_file_operations = { .llseek = generic_file_llseek, .read = do_sync_read, diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 752ccb4..fdcacf7 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -287,7 +287,7 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root) seq_puts(seq, ",grpquota"); #endif -#if defined(CONFIG_EXT2_FS_XIP) +#ifdef CONFIG_FS_DAX if (sbi->s_mount_opt & EXT2_MOUNT_XIP) seq_puts(seq, ",xip"); #endif @@ -549,7 +549,7 @@ static int parse_options(char *options, struct super_block *sb) break; #endif case Opt_xip: -#ifdef CONFIG_EXT2_FS_XIP +#ifdef CONFIG_FS_DAX set_opt (sbi->s_mount_opt, XIP); #else ext2_msg(sb, KERN_INFO, "xip option not supported"); diff --git a/include/linux/fs.h b/include/linux/fs.h index aeab3fda..bff394d 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1681,7 +1681,7 @@ struct super_operations { #define IS_IMA(inode) ((inode)->i_flags & S_IMA) #define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT) #define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC) -#ifdef CONFIG_FS_XIP +#ifdef CONFIG_FS_DAX #define IS_DAX(inode) ((inode)->i_flags & S_DAX) #else #define IS_DAX(inode) 0 @@ -2519,7 +2519,7 @@ extern loff_t fixed_size_llseek(struct file *file, loff_t offset, extern int generic_file_open(struct inode * inode, struct file * filp); extern int nonseekable_open(struct inode * inode, struct file * filp); -#ifdef CONFIG_FS_XIP +#ifdef CONFIG_FS_DAX int dax_clear_blocks(struct inode *, sector_t block, long size); int dax_truncate_page(struct inode *, loff_t from, get_block_t); ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *, -- 1.9.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752613AbaCWTOU (ORCPT ); Sun, 23 Mar 2014 15:14:20 -0400 Received: from mga01.intel.com ([192.55.52.88]:64546 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751547AbaCWTJD (ORCPT ); Sun, 23 Mar 2014 15:09:03 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,715,1389772800"; d="scan'208";a="505021433" From: Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Matthew Wilcox , willy@linux.intel.com Subject: [PATCH v7 19/22] ext4: Make ext4_block_zero_page_range static Date: Sun, 23 Mar 2014 15:08:45 -0400 Message-Id: <6ae0bcd05c2e114d3c4a7803415b6c2c8a8dadd7.1395591795.git.matthew.r.wilcox@intel.com> X-Mailer: git-send-email 1.9.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org It's only called within inode.c, so make it static, remove its prototype from ext4.h and move it above all of its callers so it doesn't need a prototype within inode.c. Signed-off-by: Matthew Wilcox --- fs/ext4/ext4.h | 2 -- fs/ext4/inode.c | 42 +++++++++++++++++++++--------------------- 2 files changed, 21 insertions(+), 23 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index d3a534f..e025c29 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2133,8 +2133,6 @@ extern int ext4_writepage_trans_blocks(struct inode *); extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks); extern int ext4_block_truncate_page(handle_t *handle, struct address_space *mapping, loff_t from); -extern int ext4_block_zero_page_range(handle_t *handle, - struct address_space *mapping, loff_t from, loff_t length); extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode, loff_t lstart, loff_t lend); extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 6e39895..ce7341c 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3312,33 +3312,13 @@ void ext4_set_aops(struct inode *inode) } /* - * ext4_block_truncate_page() zeroes out a mapping from file offset `from' - * up to the end of the block which corresponds to `from'. - * This required during truncate. We need to physically zero the tail end - * of that block so it doesn't yield old data if the file is later grown. - */ -int ext4_block_truncate_page(handle_t *handle, - struct address_space *mapping, loff_t from) -{ - unsigned offset = from & (PAGE_CACHE_SIZE-1); - unsigned length; - unsigned blocksize; - struct inode *inode = mapping->host; - - blocksize = inode->i_sb->s_blocksize; - length = blocksize - (offset & (blocksize - 1)); - - return ext4_block_zero_page_range(handle, mapping, from, length); -} - -/* * ext4_block_zero_page_range() zeros out a mapping of length 'length' * starting from file offset 'from'. The range to be zero'd must * be contained with in one block. If the specified range exceeds * the end of the block it will be shortened to end of the block * that cooresponds to 'from' */ -int ext4_block_zero_page_range(handle_t *handle, +static int ext4_block_zero_page_range(handle_t *handle, struct address_space *mapping, loff_t from, loff_t length) { ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT; @@ -3428,6 +3408,26 @@ unlock: return err; } +/* + * ext4_block_truncate_page() zeroes out a mapping from file offset `from' + * up to the end of the block which corresponds to `from'. + * This required during truncate. We need to physically zero the tail end + * of that block so it doesn't yield old data if the file is later grown. + */ +int ext4_block_truncate_page(handle_t *handle, + struct address_space *mapping, loff_t from) +{ + unsigned offset = from & (PAGE_CACHE_SIZE-1); + unsigned length; + unsigned blocksize; + struct inode *inode = mapping->host; + + blocksize = inode->i_sb->s_blocksize; + length = blocksize - (offset & (blocksize - 1)); + + return ext4_block_zero_page_range(handle, mapping, from, length); +} + int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode, loff_t lstart, loff_t length) { -- 1.9.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752842AbaCWTQf (ORCPT ); Sun, 23 Mar 2014 15:16:35 -0400 Received: from mga01.intel.com ([192.55.52.88]:9372 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751448AbaCWTJB (ORCPT ); Sun, 23 Mar 2014 15:09:01 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,715,1389772800"; d="scan'208";a="505021423" From: Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Matthew Wilcox , willy@linux.intel.com Subject: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Date: Sun, 23 Mar 2014 15:08:33 -0400 Message-Id: X-Mailer: git-send-email 1.9.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Instead of calling aops->get_xip_mem from the fault handler, the filesystem passes a get_block_t that is used to find the appropriate blocks. Signed-off-by: Matthew Wilcox --- fs/dax.c | 207 +++++++++++++++++++++++++++++++++++++++++++++++++++++ fs/ext2/file.c | 35 ++++++++- include/linux/fs.h | 4 +- mm/filemap_xip.c | 206 ---------------------------------------------------- 4 files changed, 243 insertions(+), 209 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 66a6bda..863749c 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -19,8 +19,12 @@ #include #include #include +#include +#include +#include #include #include +#include static long dax_get_addr(struct inode *inode, struct buffer_head *bh, void **addr) @@ -32,6 +36,16 @@ static long dax_get_addr(struct inode *inode, struct buffer_head *bh, return ops->direct_access(bdev, sector, addr, &pfn, bh->b_size); } +static long dax_get_pfn(struct inode *inode, struct buffer_head *bh, + unsigned long *pfn) +{ + struct block_device *bdev = bh->b_bdev; + const struct block_device_operations *ops = bdev->bd_disk->fops; + void *addr; + sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9); + return ops->direct_access(bdev, sector, &addr, pfn, bh->b_size); +} + static void dax_new_buf(void *addr, unsigned size, unsigned first, loff_t offset, loff_t end, int rw) { @@ -214,3 +228,196 @@ ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode, return retval; } EXPORT_SYMBOL_GPL(dax_do_io); + +/* + * The user has performed a load from a hole in the file. Allocating + * a new page in the file would cause excessive storage usage for + * workloads with sparse files. We allocate a page cache page instead. + * We'll kick it out of the page cache if it's ever written to, + * otherwise it will simply fall out of the page cache under memory + * pressure without ever having been dirtied. + */ +static int dax_load_hole(struct address_space *mapping, struct page *page, + struct vm_fault *vmf) +{ + unsigned long size; + struct inode *inode = mapping->host; + if (!page) + page = find_or_create_page(mapping, vmf->pgoff, + GFP_KERNEL | __GFP_ZERO); + if (!page) + return VM_FAULT_OOM; + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; + if (vmf->pgoff >= size) { + unlock_page(page); + page_cache_release(page); + return VM_FAULT_SIGBUS; + } + + vmf->page = page; + return VM_FAULT_LOCKED; +} + +static void copy_user_bh(struct page *to, struct inode *inode, + struct buffer_head *bh, unsigned long vaddr) +{ + void *vfrom, *vto; + dax_get_addr(inode, bh, &vfrom); /* XXX: error handling */ + vto = kmap_atomic(to); + copy_user_page(vto, vfrom, vaddr, to); + kunmap_atomic(vto); +} + +static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, + get_block_t get_block) +{ + struct file *file = vma->vm_file; + struct inode *inode = file_inode(file); + struct address_space *mapping = file->f_mapping; + struct page *page; + struct buffer_head bh; + unsigned long vaddr = (unsigned long)vmf->virtual_address; + sector_t block; + pgoff_t size; + unsigned long pfn; + int error; + int major = 0; + + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; + if (vmf->pgoff >= size) + return VM_FAULT_SIGBUS; + + memset(&bh, 0, sizeof(bh)); + block = (sector_t)vmf->pgoff << (PAGE_SHIFT - inode->i_blkbits); + bh.b_size = PAGE_SIZE; + + repeat: + page = find_get_page(mapping, vmf->pgoff); + if (page) { + if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) { + page_cache_release(page); + return VM_FAULT_RETRY; + } + if (unlikely(page->mapping != mapping)) { + unlock_page(page); + page_cache_release(page); + goto repeat; + } + } + + error = get_block(inode, block, &bh, 0); + if (error || bh.b_size < PAGE_SIZE) + goto sigbus; + + if (!buffer_written(&bh) && !vmf->cow_page) { + if (vmf->flags & FAULT_FLAG_WRITE) { + error = get_block(inode, block, &bh, 1); + count_vm_event(PGMAJFAULT); + mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); + major = VM_FAULT_MAJOR; + if (error || bh.b_size < PAGE_SIZE) + goto sigbus; + } else { + return dax_load_hole(mapping, page, vmf); + } + } + + /* Recheck i_size under i_mmap_mutex */ + mutex_lock(&mapping->i_mmap_mutex); + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; + if (unlikely(vmf->pgoff >= size)) { + mutex_unlock(&mapping->i_mmap_mutex); + goto sigbus; + } + if (vmf->cow_page) { + if (buffer_written(&bh)) + copy_user_bh(vmf->cow_page, inode, &bh, vaddr); + else + clear_user_highpage(vmf->cow_page, vaddr); + if (page) { + unlock_page(page); + page_cache_release(page); + } + /* do_cow_fault() will release the i_mmap_mutex */ + return VM_FAULT_COWED; + } + + if (buffer_unwritten(&bh) || buffer_new(&bh)) + dax_clear_blocks(inode, bh.b_blocknr, bh.b_size); + + error = dax_get_pfn(inode, &bh, &pfn); + if (error > 0) + error = vm_insert_mixed(vma, vaddr, pfn); + mutex_unlock(&mapping->i_mmap_mutex); + + if (page) { + delete_from_page_cache(page); + unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, + PAGE_CACHE_SIZE, 0); + unlock_page(page); + page_cache_release(page); + } + + if (error == -ENOMEM) + return VM_FAULT_OOM; + /* -EBUSY is fine, somebody else faulted on the same PTE */ + if (error != -EBUSY) + BUG_ON(error); + return VM_FAULT_NOPAGE | major; + + sigbus: + if (page) { + unlock_page(page); + page_cache_release(page); + } + return VM_FAULT_SIGBUS; +} + +/** + * dax_fault - handle a page fault on an XIP file + * @vma: The virtual memory area where the fault occurred + * @vmf: The description of the fault + * @get_block: The filesystem method used to translate file offsets to blocks + * + * When a page fault occurs, filesystems may call this helper in their + * fault handler for XIP files. + */ +int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, + get_block_t get_block) +{ + int result; + struct super_block *sb = file_inode(vma->vm_file)->i_sb; + + sb_start_pagefault(sb); + file_update_time(vma->vm_file); + result = do_dax_fault(vma, vmf, get_block); + sb_end_pagefault(sb); + + return result; +} +EXPORT_SYMBOL_GPL(dax_fault); + +/** + * dax_mkwrite - convert a read-only page to read-write in an XIP file + * @vma: The virtual memory area where the fault occurred + * @vmf: The description of the fault + * @get_block: The filesystem method used to translate file offsets to blocks + * + * XIP handles reads of holes by adding pages full of zeroes into the + * mapping. If the page is subsequenty written to, we have to allocate + * the page on media and free the page that was in the cache. + */ +int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, + get_block_t get_block) +{ + int result; + struct super_block *sb = file_inode(vma->vm_file)->i_sb; + + sb_start_pagefault(sb); + file_update_time(vma->vm_file); + result = do_dax_fault(vma, vmf, get_block); + sb_end_pagefault(sb); + + return result; +} +EXPORT_SYMBOL_GPL(dax_mkwrite); diff --git a/fs/ext2/file.c b/fs/ext2/file.c index ef5cf96..e3ce10d 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -25,6 +25,37 @@ #include "xattr.h" #include "acl.h" +#ifdef CONFIG_EXT2_FS_XIP +static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + return dax_fault(vma, vmf, ext2_get_block); +} + +static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + return dax_mkwrite(vma, vmf, ext2_get_block); +} + +static const struct vm_operations_struct ext2_dax_vm_ops = { + .fault = ext2_dax_fault, + .page_mkwrite = ext2_dax_mkwrite, + .remap_pages = generic_file_remap_pages, +}; + +static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma) +{ + if (!IS_DAX(file_inode(file))) + return generic_file_mmap(file, vma); + + file_accessed(file); + vma->vm_ops = &ext2_dax_vm_ops; + vma->vm_flags |= VM_MIXEDMAP; + return 0; +} +#else +#define ext2_file_mmap generic_file_mmap +#endif + /* * Called when filp is released. This happens when all file descriptors * for a single struct file are closed. Note that different open() calls @@ -70,7 +101,7 @@ const struct file_operations ext2_file_operations = { #ifdef CONFIG_COMPAT .compat_ioctl = ext2_compat_ioctl, #endif - .mmap = generic_file_mmap, + .mmap = ext2_file_mmap, .open = dquot_file_open, .release = ext2_release_file, .fsync = ext2_fsync, @@ -89,7 +120,7 @@ const struct file_operations ext2_xip_file_operations = { #ifdef CONFIG_COMPAT .compat_ioctl = ext2_compat_ioctl, #endif - .mmap = xip_file_mmap, + .mmap = ext2_file_mmap, .open = dquot_file_open, .release = ext2_release_file, .fsync = ext2_fsync, diff --git a/include/linux/fs.h b/include/linux/fs.h index dabc601..1607812 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -48,6 +48,7 @@ struct cred; struct swap_info_struct; struct seq_file; struct workqueue_struct; +struct vm_fault; extern void __init inode_init(void); extern void __init inode_init_early(void); @@ -2521,10 +2522,11 @@ extern int generic_file_open(struct inode * inode, struct file * filp); extern int nonseekable_open(struct inode * inode, struct file * filp); #ifdef CONFIG_FS_XIP -extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma); extern int xip_truncate_page(struct address_space *mapping, loff_t from); ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *, loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags); +int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t); +int dax_mkwrite(struct vm_area_struct *, struct vm_fault *, get_block_t); #else static inline int xip_truncate_page(struct address_space *mapping, loff_t from) { diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c index f7c37a1..9dd45f3 100644 --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -22,212 +22,6 @@ #include /* - * We do use our own empty page to avoid interference with other users - * of ZERO_PAGE(), such as /dev/zero - */ -static DEFINE_MUTEX(xip_sparse_mutex); -static seqcount_t xip_sparse_seq = SEQCNT_ZERO(xip_sparse_seq); -static struct page *__xip_sparse_page; - -/* called under xip_sparse_mutex */ -static struct page *xip_sparse_page(void) -{ - if (!__xip_sparse_page) { - struct page *page = alloc_page(GFP_HIGHUSER | __GFP_ZERO); - - if (page) - __xip_sparse_page = page; - } - return __xip_sparse_page; -} - -/* - * __xip_unmap is invoked from xip_unmap and - * xip_write - * - * This function walks all vmas of the address_space and unmaps the - * __xip_sparse_page when found at pgoff. - */ -static void -__xip_unmap (struct address_space * mapping, - unsigned long pgoff) -{ - struct vm_area_struct *vma; - struct mm_struct *mm; - unsigned long address; - pte_t *pte; - pte_t pteval; - spinlock_t *ptl; - struct page *page; - unsigned count; - int locked = 0; - - count = read_seqcount_begin(&xip_sparse_seq); - - page = __xip_sparse_page; - if (!page) - return; - -retry: - mutex_lock(&mapping->i_mmap_mutex); - vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { - mm = vma->vm_mm; - address = vma->vm_start + - ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); - BUG_ON(address < vma->vm_start || address >= vma->vm_end); - pte = page_check_address(page, mm, address, &ptl, 1); - if (pte) { - /* Nuke the page table entry. */ - flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush(vma, address, pte); - page_remove_rmap(page); - dec_mm_counter(mm, MM_FILEPAGES); - BUG_ON(pte_dirty(pteval)); - pte_unmap_unlock(pte, ptl); - /* must invalidate_page _before_ freeing the page */ - mmu_notifier_invalidate_page(mm, address); - page_cache_release(page); - } - } - mutex_unlock(&mapping->i_mmap_mutex); - - if (locked) { - mutex_unlock(&xip_sparse_mutex); - } else if (read_seqcount_retry(&xip_sparse_seq, count)) { - mutex_lock(&xip_sparse_mutex); - locked = 1; - goto retry; - } -} - -/* - * xip_fault() is invoked via the vma operations vector for a - * mapped memory region to read in file data during a page fault. - * - * This function is derived from filemap_fault, but used for execute in place - */ -static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf) -{ - struct file *file = vma->vm_file; - struct address_space *mapping = file->f_mapping; - struct inode *inode = mapping->host; - pgoff_t size; - void *xip_mem; - unsigned long xip_pfn; - struct page *page; - int error; - - /* XXX: are VM_FAULT_ codes OK? */ -again: - size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; - if (vmf->pgoff >= size) - return VM_FAULT_SIGBUS; - - error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0, - &xip_mem, &xip_pfn); - if (likely(!error)) - goto found; - if (error != -ENODATA) - return VM_FAULT_OOM; - - /* sparse block */ - if ((vma->vm_flags & (VM_WRITE | VM_MAYWRITE)) && - (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) && - (!(mapping->host->i_sb->s_flags & MS_RDONLY))) { - int err; - - /* maybe shared writable, allocate new block */ - mutex_lock(&xip_sparse_mutex); - error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 1, - &xip_mem, &xip_pfn); - mutex_unlock(&xip_sparse_mutex); - if (error) - return VM_FAULT_SIGBUS; - /* unmap sparse mappings at pgoff from all other vmas */ - __xip_unmap(mapping, vmf->pgoff); - -found: - /* We must recheck i_size under i_mmap_mutex */ - mutex_lock(&mapping->i_mmap_mutex); - size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> - PAGE_CACHE_SHIFT; - if (unlikely(vmf->pgoff >= size)) { - mutex_unlock(&mapping->i_mmap_mutex); - return VM_FAULT_SIGBUS; - } - err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address, - xip_pfn); - mutex_unlock(&mapping->i_mmap_mutex); - if (err == -ENOMEM) - return VM_FAULT_OOM; - /* - * err == -EBUSY is fine, we've raced against another thread - * that faulted-in the same page - */ - if (err != -EBUSY) - BUG_ON(err); - return VM_FAULT_NOPAGE; - } else { - int err, ret = VM_FAULT_OOM; - - mutex_lock(&xip_sparse_mutex); - write_seqcount_begin(&xip_sparse_seq); - error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0, - &xip_mem, &xip_pfn); - if (unlikely(!error)) { - write_seqcount_end(&xip_sparse_seq); - mutex_unlock(&xip_sparse_mutex); - goto again; - } - if (error != -ENODATA) - goto out; - - /* We must recheck i_size under i_mmap_mutex */ - mutex_lock(&mapping->i_mmap_mutex); - size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> - PAGE_CACHE_SHIFT; - if (unlikely(vmf->pgoff >= size)) { - ret = VM_FAULT_SIGBUS; - goto unlock; - } - /* not shared and writable, use xip_sparse_page() */ - page = xip_sparse_page(); - if (!page) - goto unlock; - err = vm_insert_page(vma, (unsigned long)vmf->virtual_address, - page); - if (err == -ENOMEM) - goto unlock; - - ret = VM_FAULT_NOPAGE; -unlock: - mutex_unlock(&mapping->i_mmap_mutex); -out: - write_seqcount_end(&xip_sparse_seq); - mutex_unlock(&xip_sparse_mutex); - - return ret; - } -} - -static const struct vm_operations_struct xip_file_vm_ops = { - .fault = xip_file_fault, - .page_mkwrite = filemap_page_mkwrite, - .remap_pages = generic_file_remap_pages, -}; - -int xip_file_mmap(struct file * file, struct vm_area_struct * vma) -{ - BUG_ON(!file->f_mapping->a_ops->get_xip_mem); - - file_accessed(file); - vma->vm_ops = &xip_file_vm_ops; - vma->vm_flags |= VM_MIXEDMAP; - return 0; -} -EXPORT_SYMBOL_GPL(xip_file_mmap); - -/* * truncate a page used for execute in place * functionality is analog to block_truncate_page but does use get_xip_mem * to get the page instead of page cache -- 1.9.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752814AbaCWTQe (ORCPT ); Sun, 23 Mar 2014 15:16:34 -0400 Received: from mga09.intel.com ([134.134.136.24]:29846 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751463AbaCWTJB (ORCPT ); Sun, 23 Mar 2014 15:09:01 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,715,1389772800"; d="scan'208";a="505944093" From: Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Matthew Wilcox , willy@linux.intel.com Subject: [PATCH v7 06/22] Replace XIP read and write with DAX I/O Date: Sun, 23 Mar 2014 15:08:32 -0400 Message-Id: <3ebe329d8713f7db4c105021a845316a47a29797.1395591795.git.matthew.r.wilcox@intel.com> X-Mailer: git-send-email 1.9.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Use the generic AIO infrastructure instead of custom read and write methods. In addition to giving us support for AIO, this adds the missing locking between read() and truncate(). Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler --- fs/Makefile | 1 + fs/dax.c | 216 +++++++++++++++++++++++++++++++++++++++++++++++++ fs/ext2/file.c | 6 +- fs/ext2/inode.c | 7 +- include/linux/fs.h | 18 ++++- mm/filemap.c | 6 +- mm/filemap_xip.c | 234 ----------------------------------------------------- 7 files changed, 243 insertions(+), 245 deletions(-) create mode 100644 fs/dax.c diff --git a/fs/Makefile b/fs/Makefile index 47ac07b..2f194cd 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -29,6 +29,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o obj-$(CONFIG_TIMERFD) += timerfd.o obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_AIO) += aio.o +obj-$(CONFIG_FS_XIP) += dax.o obj-$(CONFIG_FILE_LOCKING) += locks.o obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout.o diff --git a/fs/dax.c b/fs/dax.c new file mode 100644 index 0000000..66a6bda --- /dev/null +++ b/fs/dax.c @@ -0,0 +1,216 @@ +/* + * fs/dax.c - Direct Access filesystem code + * Copyright (c) 2013-2014 Intel Corporation + * Author: Matthew Wilcox + * Author: Ross Zwisler + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ + +#include +#include +#include +#include +#include +#include +#include + +static long dax_get_addr(struct inode *inode, struct buffer_head *bh, + void **addr) +{ + struct block_device *bdev = bh->b_bdev; + const struct block_device_operations *ops = bdev->bd_disk->fops; + unsigned long pfn; + sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9); + return ops->direct_access(bdev, sector, addr, &pfn, bh->b_size); +} + +static void dax_new_buf(void *addr, unsigned size, unsigned first, + loff_t offset, loff_t end, int rw) +{ + loff_t final = end - offset + first; /* The final byte of the buffer */ + if (rw != WRITE) { + memset(addr, 0, size); + return; + } + + if (first > 0) + memset(addr, 0, first); + if (final < size) + memset(addr + final, 0, size - final); +} + +static bool buffer_written(struct buffer_head *bh) +{ + return buffer_mapped(bh) && !buffer_unwritten(bh); +} + +/* + * When ext4 encounters a hole, it likes to return without modifying the + * buffer_head which means that we can't trust b_size. To cope with this, + * we set b_state to 0 before calling get_block and, if any bit is set, we + * know we can trust b_size. Unfortunate, really, since ext4 does know + * precisely how long a hole is and would save us time calling get_block + * repeatedly. + */ +static bool buffer_size_valid(struct buffer_head *bh) +{ + return bh->b_state != 0; +} + +static ssize_t dax_io(int rw, struct inode *inode, const struct iovec *iov, + loff_t start, loff_t end, get_block_t get_block, + struct buffer_head *bh) +{ + ssize_t retval = 0; + unsigned seg = 0; + unsigned len; + unsigned copied = 0; + loff_t offset = start; + loff_t max = start; + loff_t bh_max = start; + void *addr; + bool hole = false; + + if (rw != WRITE) + end = min(end, i_size_read(inode)); + + while (offset < end) { + void __user *buf = iov[seg].iov_base + copied; + + if (offset == max) { + sector_t block = offset >> inode->i_blkbits; + unsigned first = offset - (block << inode->i_blkbits); + long size; + + if (offset == bh_max) { + bh->b_size = PAGE_ALIGN(end - offset); + bh->b_state = 0; + retval = get_block(inode, block, bh, + rw == WRITE); + if (retval) + break; + if (!buffer_size_valid(bh)) + bh->b_size = 1 << inode->i_blkbits; + bh_max = offset - first + bh->b_size; + } else { + unsigned done = bh->b_size - (bh_max - + (offset - first)); + bh->b_blocknr += done >> inode->i_blkbits; + bh->b_size -= done; + } + if (rw == WRITE) { + if (!buffer_mapped(bh)) { + retval = -EIO; + break; + } + hole = false; + } else { + hole = !buffer_written(bh); + } + + if (hole) { + addr = NULL; + size = bh->b_size - first; + } else { + retval = dax_get_addr(inode, bh, &addr); + if (retval < 0) + break; + if (buffer_unwritten(bh) || buffer_new(bh)) + dax_new_buf(addr, retval, first, + offset, end, rw); + addr += first; + size = retval - first; + } + max = min(offset + size, end); + } + + len = min_t(unsigned, iov[seg].iov_len - copied, max - offset); + + if (rw == WRITE) + len -= __copy_from_user_nocache(addr, buf, len); + else if (!hole) + len -= __copy_to_user(buf, addr, len); + else + len -= __clear_user(buf, len); + + if (!len) + break; + + offset += len; + copied += len; + addr += len; + if (copied == iov[seg].iov_len) { + seg++; + copied = 0; + } + } + + return (offset == start) ? retval : offset - start; +} + +/** + * dax_do_io - Perform I/O to a DAX file + * @rw: READ to read or WRITE to write + * @iocb: The control block for this I/O + * @inode: The file which the I/O is directed at + * @iov: The user addresses to do I/O from or to + * @offset: The file offset where the I/O starts + * @nr_segs: The length of the iov array + * @get_block: The filesystem method used to translate file offsets to blocks + * @end_io: A filesystem callback for I/O completion + * @flags: See below + * + * This function uses the same locking scheme as do_blockdev_direct_IO: + * If @flags has DIO_LOCKING set, we assume that the i_mutex is held by the + * caller for writes. For reads, we take and release the i_mutex ourselves. + * If DIO_LOCKING is not set, the filesystem takes care of its own locking. + * As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O + * is in progress. + */ +ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode, + const struct iovec *iov, loff_t offset, unsigned nr_segs, + get_block_t get_block, dio_iodone_t end_io, int flags) +{ + struct buffer_head bh; + unsigned seg; + ssize_t retval = -EINVAL; + loff_t end = offset; + + memset(&bh, 0, sizeof(bh)); + for (seg = 0; seg < nr_segs; seg++) + end += iov[seg].iov_len; + + if ((flags & DIO_LOCKING) && (rw == READ)) { + struct address_space *mapping = inode->i_mapping; + mutex_lock(&inode->i_mutex); + retval = filemap_write_and_wait_range(mapping, offset, end - 1); + if (retval) { + mutex_unlock(&inode->i_mutex); + goto out; + } + } + + /* Protects against truncate */ + atomic_inc(&inode->i_dio_count); + + retval = dax_io(rw, inode, iov, offset, end, get_block, &bh); + + if ((flags & DIO_LOCKING) && (rw == READ)) + mutex_unlock(&inode->i_mutex); + + inode_dio_done(inode); + + if ((retval > 0) && end_io) + end_io(iocb, offset, retval, bh.b_private); + out: + return retval; +} +EXPORT_SYMBOL_GPL(dax_do_io); diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 44c36e5..ef5cf96 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -81,8 +81,10 @@ const struct file_operations ext2_file_operations = { #ifdef CONFIG_EXT2_FS_XIP const struct file_operations ext2_xip_file_operations = { .llseek = generic_file_llseek, - .read = xip_file_read, - .write = xip_file_write, + .read = do_sync_read, + .write = do_sync_write, + .aio_read = generic_file_aio_read, + .aio_write = generic_file_aio_write, .unlocked_ioctl = ext2_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = ext2_compat_ioctl, diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index e7d3192..f128ebf 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -858,7 +858,11 @@ ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, struct inode *inode = mapping->host; ssize_t ret; - ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, + if (IS_DAX(inode)) + ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs, + ext2_get_block, NULL, DIO_LOCKING); + else + ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, ext2_get_block); if (ret < 0 && (rw & WRITE)) ext2_write_failed(mapping, offset + iov_length(iov, nr_segs)); @@ -888,6 +892,7 @@ const struct address_space_operations ext2_aops = { const struct address_space_operations ext2_aops_xip = { .bmap = ext2_bmap, .get_xip_mem = ext2_get_xip_mem, + .direct_IO = ext2_direct_IO, }; const struct address_space_operations ext2_nobh_aops = { diff --git a/include/linux/fs.h b/include/linux/fs.h index 47fd219..dabc601 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2521,17 +2521,22 @@ extern int generic_file_open(struct inode * inode, struct file * filp); extern int nonseekable_open(struct inode * inode, struct file * filp); #ifdef CONFIG_FS_XIP -extern ssize_t xip_file_read(struct file *filp, char __user *buf, size_t len, - loff_t *ppos); extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma); -extern ssize_t xip_file_write(struct file *filp, const char __user *buf, - size_t len, loff_t *ppos); extern int xip_truncate_page(struct address_space *mapping, loff_t from); +ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *, + loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags); #else static inline int xip_truncate_page(struct address_space *mapping, loff_t from) { return 0; } + +static inline ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode, + const struct iovec *iov, loff_t offset, unsigned nr_segs, + get_block_t get_block, dio_iodone_t end_io, int flags) +{ + return -ENOTTY; +} #endif #ifdef CONFIG_BLOCK @@ -2681,6 +2686,11 @@ extern int generic_show_options(struct seq_file *m, struct dentry *root); extern void save_mount_options(struct super_block *sb, char *options); extern void replace_mount_options(struct super_block *sb, char *options); +static inline bool io_is_direct(struct file *filp) +{ + return (filp->f_flags & O_DIRECT) || IS_DAX(file_inode(filp)); +} + static inline ino_t parent_ino(struct dentry *dentry) { ino_t res; diff --git a/mm/filemap.c b/mm/filemap.c index 7a13f6a..1b7dff6 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1417,8 +1417,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, if (retval) return retval; - /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */ - if (filp->f_flags & O_DIRECT) { + if (io_is_direct(filp)) { loff_t size; struct address_space *mapping; struct inode *inode; @@ -2468,8 +2467,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, if (err) goto out; - /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */ - if (unlikely(file->f_flags & O_DIRECT)) { + if (io_is_direct(file)) { loff_t endbyte; ssize_t written_buffered; diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c index c8d23e9..f7c37a1 100644 --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -42,119 +42,6 @@ static struct page *xip_sparse_page(void) } /* - * This is a file read routine for execute in place files, and uses - * the mapping->a_ops->get_xip_mem() function for the actual low-level - * stuff. - * - * Note the struct file* is not used at all. It may be NULL. - */ -static ssize_t -do_xip_mapping_read(struct address_space *mapping, - struct file_ra_state *_ra, - struct file *filp, - char __user *buf, - size_t len, - loff_t *ppos) -{ - struct inode *inode = mapping->host; - pgoff_t index, end_index; - unsigned long offset; - loff_t isize, pos; - size_t copied = 0, error = 0; - - BUG_ON(!mapping->a_ops->get_xip_mem); - - pos = *ppos; - index = pos >> PAGE_CACHE_SHIFT; - offset = pos & ~PAGE_CACHE_MASK; - - isize = i_size_read(inode); - if (!isize) - goto out; - - end_index = (isize - 1) >> PAGE_CACHE_SHIFT; - do { - unsigned long nr, left; - void *xip_mem; - unsigned long xip_pfn; - int zero = 0; - - /* nr is the maximum number of bytes to copy from this page */ - nr = PAGE_CACHE_SIZE; - if (index >= end_index) { - if (index > end_index) - goto out; - nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1; - if (nr <= offset) { - goto out; - } - } - nr = nr - offset; - if (nr > len - copied) - nr = len - copied; - - error = mapping->a_ops->get_xip_mem(mapping, index, 0, - &xip_mem, &xip_pfn); - if (unlikely(error)) { - if (error == -ENODATA) { - /* sparse */ - zero = 1; - } else - goto out; - } - - /* If users can be writing to this page using arbitrary - * virtual addresses, take care about potential aliasing - * before reading the page on the kernel side. - */ - if (mapping_writably_mapped(mapping)) - /* address based flush */ ; - - /* - * Ok, we have the mem, so now we can copy it to user space... - * - * The actor routine returns how many bytes were actually used.. - * NOTE! This may not be the same as how much of a user buffer - * we filled up (we may be padding etc), so we can only update - * "pos" here (the actor routine has to update the user buffer - * pointers and the remaining count). - */ - if (!zero) - left = __copy_to_user(buf+copied, xip_mem+offset, nr); - else - left = __clear_user(buf + copied, nr); - - if (left) { - error = -EFAULT; - goto out; - } - - copied += (nr - left); - offset += (nr - left); - index += offset >> PAGE_CACHE_SHIFT; - offset &= ~PAGE_CACHE_MASK; - } while (copied < len); - -out: - *ppos = pos + copied; - if (filp) - file_accessed(filp); - - return (copied ? copied : error); -} - -ssize_t -xip_file_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos) -{ - if (!access_ok(VERIFY_WRITE, buf, len)) - return -EFAULT; - - return do_xip_mapping_read(filp->f_mapping, &filp->f_ra, filp, - buf, len, ppos); -} -EXPORT_SYMBOL_GPL(xip_file_read); - -/* * __xip_unmap is invoked from xip_unmap and * xip_write * @@ -340,127 +227,6 @@ int xip_file_mmap(struct file * file, struct vm_area_struct * vma) } EXPORT_SYMBOL_GPL(xip_file_mmap); -static ssize_t -__xip_file_write(struct file *filp, const char __user *buf, - size_t count, loff_t pos, loff_t *ppos) -{ - struct address_space * mapping = filp->f_mapping; - const struct address_space_operations *a_ops = mapping->a_ops; - struct inode *inode = mapping->host; - long status = 0; - size_t bytes; - ssize_t written = 0; - - BUG_ON(!mapping->a_ops->get_xip_mem); - - do { - unsigned long index; - unsigned long offset; - size_t copied; - void *xip_mem; - unsigned long xip_pfn; - - offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */ - index = pos >> PAGE_CACHE_SHIFT; - bytes = PAGE_CACHE_SIZE - offset; - if (bytes > count) - bytes = count; - - status = a_ops->get_xip_mem(mapping, index, 0, - &xip_mem, &xip_pfn); - if (status == -ENODATA) { - /* we allocate a new page unmap it */ - mutex_lock(&xip_sparse_mutex); - status = a_ops->get_xip_mem(mapping, index, 1, - &xip_mem, &xip_pfn); - mutex_unlock(&xip_sparse_mutex); - if (!status) - /* unmap page at pgoff from all other vmas */ - __xip_unmap(mapping, index); - } - - if (status) - break; - - copied = bytes - - __copy_from_user_nocache(xip_mem + offset, buf, bytes); - - if (likely(copied > 0)) { - status = copied; - - if (status >= 0) { - written += status; - count -= status; - pos += status; - buf += status; - } - } - if (unlikely(copied != bytes)) - if (status >= 0) - status = -EFAULT; - if (status < 0) - break; - } while (count); - *ppos = pos; - /* - * No need to use i_size_read() here, the i_size - * cannot change under us because we hold i_mutex. - */ - if (pos > inode->i_size) { - i_size_write(inode, pos); - mark_inode_dirty(inode); - } - - return written ? written : status; -} - -ssize_t -xip_file_write(struct file *filp, const char __user *buf, size_t len, - loff_t *ppos) -{ - struct address_space *mapping = filp->f_mapping; - struct inode *inode = mapping->host; - size_t count; - loff_t pos; - ssize_t ret; - - mutex_lock(&inode->i_mutex); - - if (!access_ok(VERIFY_READ, buf, len)) { - ret=-EFAULT; - goto out_up; - } - - pos = *ppos; - count = len; - - /* We can write back this queue in page reclaim */ - current->backing_dev_info = mapping->backing_dev_info; - - ret = generic_write_checks(filp, &pos, &count, S_ISBLK(inode->i_mode)); - if (ret) - goto out_backing; - if (count == 0) - goto out_backing; - - ret = file_remove_suid(filp); - if (ret) - goto out_backing; - - ret = file_update_time(filp); - if (ret) - goto out_backing; - - ret = __xip_file_write (filp, buf, count, pos, ppos); - - out_backing: - current->backing_dev_info = NULL; - out_up: - mutex_unlock(&inode->i_mutex); - return ret; -} -EXPORT_SYMBOL_GPL(xip_file_write); - /* * truncate a page used for execute in place * functionality is analog to block_truncate_page but does use get_xip_mem -- 1.9.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752790AbaCWTQc (ORCPT ); Sun, 23 Mar 2014 15:16:32 -0400 Received: from mga01.intel.com ([192.55.52.88]:64546 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751465AbaCWTJB (ORCPT ); Sun, 23 Mar 2014 15:09:01 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,715,1389772800"; d="scan'208";a="505021426" From: Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Matthew Wilcox , willy@linux.intel.com Subject: [PATCH v7 11/22] Replace ext2_clear_xip_target with dax_clear_blocks Date: Sun, 23 Mar 2014 15:08:37 -0400 Message-Id: X-Mailer: git-send-email 1.9.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This is practically generic code; other filesystems will want to call it from other places, but there's nothing ext2-specific about it. Make it a little more generic by allowing it to take a count of the number of bytes to zero rather than fixing it to a single page. Thanks to Dave Hansen for suggesting that I need to call cond_resched() if zeroing more than one page. Signed-off-by: Matthew Wilcox --- fs/dax.c | 34 ++++++++++++++++++++++++++++++++++ fs/ext2/inode.c | 8 +++++--- fs/ext2/xip.c | 23 ----------------------- fs/ext2/xip.h | 3 --- include/linux/fs.h | 6 ++++++ 5 files changed, 45 insertions(+), 29 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 7271be0..45a0a41 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -23,9 +23,43 @@ #include #include #include +#include #include #include +int dax_clear_blocks(struct inode *inode, sector_t block, long size) +{ + struct block_device *bdev = inode->i_sb->s_bdev; + const struct block_device_operations *ops = bdev->bd_disk->fops; + sector_t sector = block << (inode->i_blkbits - 9); + unsigned long pfn; + + might_sleep(); + do { + void *addr; + long count = ops->direct_access(bdev, sector, &addr, &pfn, + size); + if (count < 0) + return count; + while (count >= PAGE_SIZE) { + clear_page(addr); + addr += PAGE_SIZE; + size -= PAGE_SIZE; + count -= PAGE_SIZE; + sector += PAGE_SIZE / 512; + cond_resched(); + } + if (count > 0) { + memset(addr, 0, count); + sector += count / 512; + size -= count; + } + } while (size); + + return 0; +} +EXPORT_SYMBOL_GPL(dax_clear_blocks); + static long dax_get_addr(struct inode *inode, struct buffer_head *bh, void **addr) { diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index b156fe8..a9346a9 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -733,10 +733,12 @@ static int ext2_get_blocks(struct inode *inode, if (IS_DAX(inode)) { /* - * we need to clear the block + * block must be initialised before we put it in the tree + * so that it's not found by another thread before it's + * initialised */ - err = ext2_clear_xip_target (inode, - le32_to_cpu(chain[depth-1].key)); + err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key), + count << inode->i_blkbits); if (err) { mutex_unlock(&ei->truncate_mutex); goto cleanup; diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c index ca745ff..132d4da 100644 --- a/fs/ext2/xip.c +++ b/fs/ext2/xip.c @@ -13,29 +13,6 @@ #include "ext2.h" #include "xip.h" -static inline long __inode_direct_access(struct inode *inode, sector_t block, - void **kaddr, unsigned long *pfn, long size) -{ - struct block_device *bdev = inode->i_sb->s_bdev; - const struct block_device_operations *ops = bdev->bd_disk->fops; - sector_t sector = block * (PAGE_SIZE / 512); - return ops->direct_access(bdev, sector, kaddr, pfn, size); -} - -int -ext2_clear_xip_target(struct inode *inode, sector_t block) -{ - void *kaddr; - unsigned long pfn; - long size; - - size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE); - if (size < 0) - return size; - clear_page(kaddr); - return 0; -} - void ext2_xip_verify_sb(struct super_block *sb) { struct ext2_sb_info *sbi = EXT2_SB(sb); diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h index 0fa8b7f..e7b9f0a 100644 --- a/fs/ext2/xip.h +++ b/fs/ext2/xip.h @@ -7,8 +7,6 @@ #ifdef CONFIG_EXT2_FS_XIP extern void ext2_xip_verify_sb (struct super_block *); -extern int ext2_clear_xip_target (struct inode *, sector_t); - static inline int ext2_use_xip (struct super_block *sb) { struct ext2_sb_info *sbi = EXT2_SB(sb); @@ -17,5 +15,4 @@ static inline int ext2_use_xip (struct super_block *sb) #else #define ext2_xip_verify_sb(sb) do { } while (0) #define ext2_use_xip(sb) 0 -#define ext2_clear_xip_target(inode, chain) 0 #endif diff --git a/include/linux/fs.h b/include/linux/fs.h index c777056..aeab3fda 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2520,12 +2520,18 @@ extern int generic_file_open(struct inode * inode, struct file * filp); extern int nonseekable_open(struct inode * inode, struct file * filp); #ifdef CONFIG_FS_XIP +int dax_clear_blocks(struct inode *, sector_t block, long size); int dax_truncate_page(struct inode *, loff_t from, get_block_t); ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *, loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags); int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t); int dax_mkwrite(struct vm_area_struct *, struct vm_fault *, get_block_t); #else +static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz) +{ + return 0; +} + static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb) { return 0; -- 1.9.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752756AbaCWTQb (ORCPT ); Sun, 23 Mar 2014 15:16:31 -0400 Received: from mga09.intel.com ([134.134.136.24]:29849 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751475AbaCWTJB (ORCPT ); Sun, 23 Mar 2014 15:09:01 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,715,1389772800"; d="scan'208";a="505944100" From: Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Matthew Wilcox , willy@linux.intel.com Subject: [PATCH v7 09/22] Remove mm/filemap_xip.c Date: Sun, 23 Mar 2014 15:08:35 -0400 Message-Id: <69ab315f0124881ae74d9881c48c7bdc70368fd1.1395591795.git.matthew.r.wilcox@intel.com> X-Mailer: git-send-email 1.9.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org It is now empty as all of its contents have been replaced by fs/xip.c Signed-off-by: Matthew Wilcox --- mm/Makefile | 1 - mm/filemap_xip.c | 23 ----------------------- 2 files changed, 24 deletions(-) delete mode 100644 mm/filemap_xip.c diff --git a/mm/Makefile b/mm/Makefile index 310c90a..454c176 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -47,7 +47,6 @@ obj-$(CONFIG_SLUB) += slub.o obj-$(CONFIG_KMEMCHECK) += kmemcheck.o obj-$(CONFIG_FAILSLAB) += failslab.o obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o -obj-$(CONFIG_FS_XIP) += filemap_xip.o obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c deleted file mode 100644 index 6316578..0000000 --- a/mm/filemap_xip.c +++ /dev/null @@ -1,23 +0,0 @@ -/* - * linux/mm/filemap_xip.c - * - * Copyright (C) 2005 IBM Corporation - * Author: Carsten Otte - * - * derived from linux/mm/filemap.c - Copyright (C) Linus Torvalds - * - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -- 1.9.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752715AbaCWTQa (ORCPT ); Sun, 23 Mar 2014 15:16:30 -0400 Received: from mga09.intel.com ([134.134.136.24]:29851 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751494AbaCWTJB (ORCPT ); Sun, 23 Mar 2014 15:09:01 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,715,1389772800"; d="scan'208";a="497904582" From: Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Matthew Wilcox , willy@linux.intel.com Subject: [PATCH v7 12/22] ext2: Remove ext2_xip_verify_sb() Date: Sun, 23 Mar 2014 15:08:38 -0400 Message-Id: <5f91cb658e1ee1b593be9fd719e8f204b0069031.1395591795.git.matthew.r.wilcox@intel.com> X-Mailer: git-send-email 1.9.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Jan Kara pointed out that calling ext2_xip_verify_sb() in ext2_remount() doesn't make sense, since changing the XIP option on remount isn't allowed. It also doesn't make sense to re-check whether blocksize is supported since it can't change between mounts. Replace the call to ext2_xip_verify_sb() in ext2_fill_super() with the equivalent check and delete the definition. Signed-off-by: Matthew Wilcox --- fs/ext2/super.c | 33 ++++++++++++--------------------- fs/ext2/xip.c | 12 ------------ fs/ext2/xip.h | 2 -- 3 files changed, 12 insertions(+), 35 deletions(-) diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 20d6697..3a1db39 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -868,9 +868,6 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) ((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0); - ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset - EXT2_MOUNT_XIP if not */ - if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV && (EXT2_HAS_COMPAT_FEATURE(sb, ~0U) || EXT2_HAS_RO_COMPAT_FEATURE(sb, ~0U) || @@ -900,11 +897,17 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size); - if (ext2_use_xip(sb) && blocksize != PAGE_SIZE) { - if (!silent) + if (sbi->s_mount_opt & EXT2_MOUNT_XIP) { + if (blocksize != PAGE_SIZE) { ext2_msg(sb, KERN_ERR, - "error: unsupported blocksize for xip"); - goto failed_mount; + "error: unsupported blocksize for xip"); + goto failed_mount; + } + if (!sb->s_bdev->bd_disk->fops->direct_access) { + ext2_msg(sb, KERN_ERR, + "error: device does not support xip"); + goto failed_mount; + } } /* If the blocksize doesn't match, re-read the thing.. */ @@ -1249,7 +1252,6 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data) { struct ext2_sb_info * sbi = EXT2_SB(sb); struct ext2_super_block * es; - unsigned long old_mount_opt = sbi->s_mount_opt; struct ext2_mount_options old_opts; unsigned long old_sb_flags; int err; @@ -1273,22 +1275,11 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data) sb->s_flags = (sb->s_flags & ~MS_POSIXACL) | ((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0); - ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset - EXT2_MOUNT_XIP if not */ - - if ((ext2_use_xip(sb)) && (sb->s_blocksize != PAGE_SIZE)) { - ext2_msg(sb, KERN_WARNING, - "warning: unsupported blocksize for xip"); - err = -EINVAL; - goto restore_opts; - } - es = sbi->s_es; - if ((sbi->s_mount_opt ^ old_mount_opt) & EXT2_MOUNT_XIP) { + if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) { ext2_msg(sb, KERN_WARNING, "warning: refusing change of " "xip flag with busy inodes while remounting"); - sbi->s_mount_opt &= ~EXT2_MOUNT_XIP; - sbi->s_mount_opt |= old_mount_opt & EXT2_MOUNT_XIP; + sbi->s_mount_opt ^= EXT2_MOUNT_XIP; } if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) { spin_unlock(&sbi->s_lock); diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c index 132d4da..66ca113 100644 --- a/fs/ext2/xip.c +++ b/fs/ext2/xip.c @@ -13,15 +13,3 @@ #include "ext2.h" #include "xip.h" -void ext2_xip_verify_sb(struct super_block *sb) -{ - struct ext2_sb_info *sbi = EXT2_SB(sb); - - if ((sbi->s_mount_opt & EXT2_MOUNT_XIP) && - !sb->s_bdev->bd_disk->fops->direct_access) { - sbi->s_mount_opt &= (~EXT2_MOUNT_XIP); - ext2_msg(sb, KERN_WARNING, - "warning: ignoring xip option - " - "not supported by bdev"); - } -} diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h index e7b9f0a..87eeb04 100644 --- a/fs/ext2/xip.h +++ b/fs/ext2/xip.h @@ -6,13 +6,11 @@ */ #ifdef CONFIG_EXT2_FS_XIP -extern void ext2_xip_verify_sb (struct super_block *); static inline int ext2_use_xip (struct super_block *sb) { struct ext2_sb_info *sbi = EXT2_SB(sb); return (sbi->s_mount_opt & EXT2_MOUNT_XIP); } #else -#define ext2_xip_verify_sb(sb) do { } while (0) #define ext2_use_xip(sb) 0 #endif -- 1.9.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752205AbaCWTTn (ORCPT ); Sun, 23 Mar 2014 15:19:43 -0400 Received: from mga09.intel.com ([134.134.136.24]:29849 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751443AbaCWTJA (ORCPT ); Sun, 23 Mar 2014 15:09:00 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,715,1389772800"; d="scan'208";a="497904564" From: Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Matthew Wilcox , willy@linux.intel.com Subject: [PATCH v7 04/22] Change direct_access calling convention Date: Sun, 23 Mar 2014 15:08:30 -0400 Message-Id: <214af2a38d840d0b8e983d39d03711d1292bc2d6.1395591795.git.matthew.r.wilcox@intel.com> X-Mailer: git-send-email 1.9.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org In order to support accesses to larger chunks of memory, pass in a 'size' parameter (counted in bytes), and return the amount available at that address. Signed-off-by: Matthew Wilcox --- Documentation/filesystems/xip.txt | 15 +++++++++------ arch/powerpc/sysdev/axonram.c | 6 +++--- drivers/block/brd.c | 8 +++++--- drivers/s390/block/dcssblk.c | 19 ++++++++++--------- fs/ext2/xip.c | 30 +++++++++++++----------------- include/linux/blkdev.h | 4 ++-- 6 files changed, 42 insertions(+), 40 deletions(-) diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt index 0466ee5..b62eabf 100644 --- a/Documentation/filesystems/xip.txt +++ b/Documentation/filesystems/xip.txt @@ -28,12 +28,15 @@ Implementation Execute-in-place is implemented in three steps: block device operation, address space operation, and file operations. -A block device operation named direct_access is used to retrieve a -reference (pointer) to a block on-disk. The reference is supposed to be -cpu-addressable, physical address and remain valid until the release operation -is performed. A struct block_device reference is used to address the device, -and a sector_t argument is used to identify the individual block. As an -alternative, memory technology devices can be used for this. +A block device operation named direct_access is used to translate the +block device sector number to a page frame number (pfn) that identifies +the physical page for the memory. It also returns a kernel virtual +address that can be used to access the memory. + +The direct_access method takes a 'size' parameter that indicates the +number of bytes being requested. The function should return the number +of bytes that it can provide, although it must not exceed the number of +bytes requested. It may also return a negative errno if an error occurs. The block device operation is optional, these block devices support it as of today: diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c index 830edc8..1697e29 100644 --- a/arch/powerpc/sysdev/axonram.c +++ b/arch/powerpc/sysdev/axonram.c @@ -139,9 +139,9 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio) * axon_ram_direct_access - direct_access() method for block device * @device, @sector, @data: see block_device_operations method */ -static int +static long axon_ram_direct_access(struct block_device *device, sector_t sector, - void **kaddr, unsigned long *pfn) + void **kaddr, unsigned long *pfn, long size) { struct axon_ram_bank *bank = device->bd_disk->private_data; loff_t offset; @@ -158,7 +158,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector, *kaddr = (void *)(bank->ph_addr + offset); *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT; - return 0; + return min_t(unsigned long, size, bank->size - offset); } static const struct block_device_operations axon_ram_devops = { diff --git a/drivers/block/brd.c b/drivers/block/brd.c index e73b85c..00da60d 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -361,8 +361,8 @@ out: } #ifdef CONFIG_BLK_DEV_XIP -static int brd_direct_access(struct block_device *bdev, sector_t sector, - void **kaddr, unsigned long *pfn) +static long brd_direct_access(struct block_device *bdev, sector_t sector, + void **kaddr, unsigned long *pfn, long size) { struct brd_device *brd = bdev->bd_disk->private_data; struct page *page; @@ -379,7 +379,9 @@ static int brd_direct_access(struct block_device *bdev, sector_t sector, *kaddr = page_address(page); *pfn = page_to_pfn(page); - return 0; + /* Could optimistically check to see if the next page in the + * file is mapped to the next page of physical RAM */ + return PAGE_SIZE; } #endif diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c index ebf41e2..da914b2 100644 --- a/drivers/s390/block/dcssblk.c +++ b/drivers/s390/block/dcssblk.c @@ -28,8 +28,8 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode); static void dcssblk_release(struct gendisk *disk, fmode_t mode); static void dcssblk_make_request(struct request_queue *q, struct bio *bio); -static int dcssblk_direct_access(struct block_device *bdev, sector_t secnum, - void **kaddr, unsigned long *pfn); +static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum, + void **kaddr, unsigned long *pfn, long size); static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0"; @@ -866,25 +866,26 @@ fail: bio_io_error(bio); } -static int +static long dcssblk_direct_access (struct block_device *bdev, sector_t secnum, - void **kaddr, unsigned long *pfn) + void **kaddr, unsigned long *pfn, long size) { struct dcssblk_dev_info *dev_info; - unsigned long pgoff; + unsigned long offset, dev_sz; dev_info = bdev->bd_disk->private_data; if (!dev_info) return -ENODEV; + dev_sz = dev_info->end - dev_info->start; if (secnum % (PAGE_SIZE/512)) return -EINVAL; - pgoff = secnum / (PAGE_SIZE / 512); - if ((pgoff+1)*PAGE_SIZE-1 > dev_info->end - dev_info->start) + offset = secnum * 512; + if (offset > dev_sz) return -ERANGE; - *kaddr = (void *) (dev_info->start+pgoff*PAGE_SIZE); + *kaddr = (void *) (dev_info->start + offset); *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT; - return 0; + return min_t(unsigned long, size, dev_sz - offset); } static void diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c index e98171a..fa40091 100644 --- a/fs/ext2/xip.c +++ b/fs/ext2/xip.c @@ -13,18 +13,13 @@ #include "ext2.h" #include "xip.h" -static inline int -__inode_direct_access(struct inode *inode, sector_t block, - void **kaddr, unsigned long *pfn) +static inline long __inode_direct_access(struct inode *inode, sector_t block, + void **kaddr, unsigned long *pfn, long size) { struct block_device *bdev = inode->i_sb->s_bdev; const struct block_device_operations *ops = bdev->bd_disk->fops; - sector_t sector; - - sector = block * (PAGE_SIZE / 512); /* ext2 block to bdev sector */ - - BUG_ON(!ops->direct_access); - return ops->direct_access(bdev, sector, kaddr, pfn); + sector_t sector = block * (PAGE_SIZE / 512); + return ops->direct_access(bdev, sector, kaddr, pfn, size); } static inline int @@ -53,12 +48,13 @@ ext2_clear_xip_target(struct inode *inode, sector_t block) { void *kaddr; unsigned long pfn; - int rc; + long size; - rc = __inode_direct_access(inode, block, &kaddr, &pfn); - if (!rc) - clear_page(kaddr); - return rc; + size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE); + if (size < 0) + return size; + clear_page(kaddr); + return 0; } void ext2_xip_verify_sb(struct super_block *sb) @@ -77,7 +73,7 @@ void ext2_xip_verify_sb(struct super_block *sb) int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create, void **kmem, unsigned long *pfn) { - int rc; + long rc; sector_t block; /* first, retrieve the sector number */ @@ -86,6 +82,6 @@ int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create, return rc; /* retrieve address of the target data */ - rc = __inode_direct_access(mapping->host, block, kmem, pfn); - return rc; + rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE); + return (rc < 0) ? rc : 0; } diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 4afa4f8..c6f6210 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1560,8 +1560,8 @@ struct block_device_operations { void (*release) (struct gendisk *, fmode_t); int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long); int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long); - int (*direct_access) (struct block_device *, sector_t, - void **, unsigned long *); + long (*direct_access) (struct block_device *, sector_t, + void **, unsigned long *pfn, long size); unsigned int (*check_events) (struct gendisk *disk, unsigned int clearing); /* ->media_changed() is DEPRECATED, use ->check_events() instead */ -- 1.9.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752690AbaCWTUG (ORCPT ); Sun, 23 Mar 2014 15:20:06 -0400 Received: from mga01.intel.com ([192.55.52.88]:9372 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751417AbaCWTJA (ORCPT ); Sun, 23 Mar 2014 15:09:00 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,715,1389772800"; d="scan'208";a="505021418" From: Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Matthew Wilcox , willy@linux.intel.com Subject: [PATCH v7 05/22] Introduce IS_DAX(inode) Date: Sun, 23 Mar 2014 15:08:31 -0400 Message-Id: <6a8918c9a0fb37882179e3699b3e04d96540b24f.1395591795.git.matthew.r.wilcox@intel.com> X-Mailer: git-send-email 1.9.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Use an inode flag to tag inodes which should avoid using the page cache. Convert ext2 to use it instead of mapping_is_xip(). Signed-off-by: Matthew Wilcox --- fs/ext2/inode.c | 9 ++++++--- fs/ext2/xip.h | 2 -- include/linux/fs.h | 6 ++++++ 3 files changed, 12 insertions(+), 5 deletions(-) diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 94ed3684..e7d3192 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -731,7 +731,7 @@ static int ext2_get_blocks(struct inode *inode, goto cleanup; } - if (ext2_use_xip(inode->i_sb)) { + if (IS_DAX(inode)) { /* * we need to clear the block */ @@ -1201,7 +1201,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize) inode_dio_wait(inode); - if (mapping_is_xip(inode->i_mapping)) + if (IS_DAX(inode)) error = xip_truncate_page(inode->i_mapping, newsize); else if (test_opt(inode->i_sb, NOBH)) error = nobh_truncate_page(inode->i_mapping, @@ -1273,7 +1273,8 @@ void ext2_set_inode_flags(struct inode *inode) { unsigned int flags = EXT2_I(inode)->i_flags; - inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC); + inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | + S_DIRSYNC | S_DAX); if (flags & EXT2_SYNC_FL) inode->i_flags |= S_SYNC; if (flags & EXT2_APPEND_FL) @@ -1284,6 +1285,8 @@ void ext2_set_inode_flags(struct inode *inode) inode->i_flags |= S_NOATIME; if (flags & EXT2_DIRSYNC_FL) inode->i_flags |= S_DIRSYNC; + if (test_opt(inode->i_sb, XIP)) + inode->i_flags |= S_DAX; } /* Propagate flags from i_flags to EXT2_I(inode)->i_flags */ diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h index 18b34d2..29be737 100644 --- a/fs/ext2/xip.h +++ b/fs/ext2/xip.h @@ -16,9 +16,7 @@ static inline int ext2_use_xip (struct super_block *sb) } int ext2_get_xip_mem(struct address_space *, pgoff_t, int, void **, unsigned long *); -#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_mem) #else -#define mapping_is_xip(map) 0 #define ext2_xip_verify_sb(sb) do { } while (0) #define ext2_use_xip(sb) 0 #define ext2_clear_xip_target(inode, chain) 0 diff --git a/include/linux/fs.h b/include/linux/fs.h index 23b2a35..47fd219 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1644,6 +1644,7 @@ struct super_operations { #define S_IMA 1024 /* Inode has an associated IMA struct */ #define S_AUTOMOUNT 2048 /* Automount/referral quasi-directory */ #define S_NOSEC 4096 /* no suid or xattr security attributes */ +#define S_DAX 8192 /* Direct Access, avoiding the page cache */ /* * Note that nosuid etc flags are inode-specific: setting some file-system @@ -1681,6 +1682,11 @@ struct super_operations { #define IS_IMA(inode) ((inode)->i_flags & S_IMA) #define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT) #define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC) +#ifdef CONFIG_FS_XIP +#define IS_DAX(inode) ((inode)->i_flags & S_DAX) +#else +#define IS_DAX(inode) 0 +#endif /* * Inode state bits. Protected by inode->i_lock -- 1.9.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752399AbaCWTUE (ORCPT ); Sun, 23 Mar 2014 15:20:04 -0400 Received: from mga09.intel.com ([134.134.136.24]:29846 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751439AbaCWTJA (ORCPT ); Sun, 23 Mar 2014 15:09:00 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,715,1389772800"; d="scan'208";a="497904559" From: Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Matthew Wilcox , willy@linux.intel.com Subject: [PATCH v7 03/22] axonram: Fix bug in direct_access Date: Sun, 23 Mar 2014 15:08:29 -0400 Message-Id: X-Mailer: git-send-email 1.9.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The 'pfn' returned by axonram was completely bogus, and has been since 2008. Signed-off-by: Matthew Wilcox --- arch/powerpc/sysdev/axonram.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c index 47b6b9f..830edc8 100644 --- a/arch/powerpc/sysdev/axonram.c +++ b/arch/powerpc/sysdev/axonram.c @@ -156,7 +156,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector, } *kaddr = (void *)(bank->ph_addr + offset); - *pfn = virt_to_phys(kaddr) >> PAGE_SHIFT; + *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT; return 0; } -- 1.9.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752889AbaCWTWB (ORCPT ); Sun, 23 Mar 2014 15:22:01 -0400 Received: from mga09.intel.com ([134.134.136.24]:29846 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751416AbaCWTI7 (ORCPT ); Sun, 23 Mar 2014 15:08:59 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,715,1389772800"; d="scan'208";a="505944089" From: Matthew Wilcox To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Matthew Wilcox , willy@linux.intel.com Subject: [PATCH v7 01/22] Fix XIP fault vs truncate race Date: Sun, 23 Mar 2014 15:08:27 -0400 Message-Id: <59d73a58d4cfbe190a16ce912bb2776d9cc95447.1395591795.git.matthew.r.wilcox@intel.com> X-Mailer: git-send-email 1.9.0 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Pagecache faults recheck i_size after taking the page lock to ensure that the fault didn't race against a truncate. We don't have a page to lock in the XIP case, so use the i_mmap_mutex instead. It is locked in the truncate path in unmap_mapping_range() after updating i_size. So while we hold it in the fault path, we are guaranteed that either i_size has already been updated in the truncate path, or that the truncate will subsequently call zap_page_range_single() and so remove the mapping we have just inserted. There is a window of time in which i_size has been reduced and the thread has a mapping to a page which will be removed from the file, but this is harmless as the page will not be allocated to a different purpose before the thread's access to it is revoked. Signed-off-by: Matthew Wilcox --- mm/filemap_xip.c | 24 ++++++++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c index d8d9fe3..c8d23e9 100644 --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -260,8 +260,17 @@ again: __xip_unmap(mapping, vmf->pgoff); found: + /* We must recheck i_size under i_mmap_mutex */ + mutex_lock(&mapping->i_mmap_mutex); + size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> + PAGE_CACHE_SHIFT; + if (unlikely(vmf->pgoff >= size)) { + mutex_unlock(&mapping->i_mmap_mutex); + return VM_FAULT_SIGBUS; + } err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address, xip_pfn); + mutex_unlock(&mapping->i_mmap_mutex); if (err == -ENOMEM) return VM_FAULT_OOM; /* @@ -285,16 +294,27 @@ found: } if (error != -ENODATA) goto out; + + /* We must recheck i_size under i_mmap_mutex */ + mutex_lock(&mapping->i_mmap_mutex); + size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> + PAGE_CACHE_SHIFT; + if (unlikely(vmf->pgoff >= size)) { + ret = VM_FAULT_SIGBUS; + goto unlock; + } /* not shared and writable, use xip_sparse_page() */ page = xip_sparse_page(); if (!page) - goto out; + goto unlock; err = vm_insert_page(vma, (unsigned long)vmf->virtual_address, page); if (err == -ENOMEM) - goto out; + goto unlock; ret = VM_FAULT_NOPAGE; +unlock: + mutex_unlock(&mapping->i_mmap_mutex); out: write_seqcount_end(&xip_sparse_seq); mutex_unlock(&xip_sparse_mutex); -- 1.9.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753758AbaCXTMN (ORCPT ); Mon, 24 Mar 2014 15:12:13 -0400 Received: from imap.thunk.org ([74.207.234.97]:45401 "EHLO imap.thunk.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753360AbaCXTML (ORCPT ); Mon, 24 Mar 2014 15:12:11 -0400 Date: Mon, 24 Mar 2014 15:11:59 -0400 From: tytso@mit.edu To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com, linux-ext4@vger.kernel.org Subject: Re: [PATCH v7 19/22] ext4: Make ext4_block_zero_page_range static Message-ID: <20140324191158.GC6896@thunk.org> Mail-Followup-To: tytso@mit.edu, Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com, linux-ext4@vger.kernel.org References: <6ae0bcd05c2e114d3c4a7803415b6c2c8a8dadd7.1395591795.git.matthew.r.wilcox@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6ae0bcd05c2e114d3c4a7803415b6c2c8a8dadd7.1395591795.git.matthew.r.wilcox@intel.com> User-Agent: Mutt/1.5.22 (2013-10-16) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: tytso@thunk.org X-SA-Exim-Scanned: No (on imap.thunk.org); SAEximRunCond expanded to false Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Mar 23, 2014 at 03:08:45PM -0400, Matthew Wilcox wrote: > It's only called within inode.c, so make it static, remove its prototype > from ext4.h and move it above all of its callers so it doesn't need a > prototype within inode.c. > > Signed-off-by: Matthew Wilcox Thanks, applied to the ext4 tree. - Ted From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753922AbaCXTQT (ORCPT ); Mon, 24 Mar 2014 15:16:19 -0400 Received: from imap.thunk.org ([74.207.234.97]:45410 "EHLO imap.thunk.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750909AbaCXTQR (ORCPT ); Mon, 24 Mar 2014 15:16:17 -0400 Date: Mon, 24 Mar 2014 15:16:14 -0400 From: tytso@mit.edu To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com, linux-ext4@vger.kernel.org Subject: Re: [PATCH v7 21/22] ext4: Fix typos Message-ID: <20140324191614.GD6896@thunk.org> Mail-Followup-To: tytso@mit.edu, Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com, linux-ext4@vger.kernel.org References: <2b2c5467283817503fede11d12cba8aef912c9c5.1395591795.git.matthew.r.wilcox@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2b2c5467283817503fede11d12cba8aef912c9c5.1395591795.git.matthew.r.wilcox@intel.com> User-Agent: Mutt/1.5.22 (2013-10-16) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: tytso@thunk.org X-SA-Exim-Scanned: No (on imap.thunk.org); SAEximRunCond expanded to false Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Mar 23, 2014 at 03:08:47PM -0400, Matthew Wilcox wrote: > Comment fix only > > Signed-off-by: Matthew Wilcox Thanks, applied to the ext4 git tree. - Ted From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754077AbaCaK2R (ORCPT ); Mon, 31 Mar 2014 06:28:17 -0400 Received: from cantor2.suse.de ([195.135.220.15]:45241 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753914AbaCaK2O (ORCPT ); Mon, 31 Mar 2014 06:28:14 -0400 Date: Sat, 29 Mar 2014 17:22:16 +0100 From: Jan Kara To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com Subject: Re: [PATCH v7 03/22] axonram: Fix bug in direct_access Message-ID: <20140329162216.GC1211@quack.suse.cz> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 23-03-14 15:08:29, Matthew Wilcox wrote: > The 'pfn' returned by axonram was completely bogus, and has been since > 2008. Maybe time to drop the driver instead? When noone noticed for 6 years, it seems pretty much dead... Or is there some possibility the driver can get reused for new HW? Anyway the patch looks correct so feel free to add: Reviewed-by: Jan Kara Honza > > Signed-off-by: Matthew Wilcox > --- > arch/powerpc/sysdev/axonram.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c > index 47b6b9f..830edc8 100644 > --- a/arch/powerpc/sysdev/axonram.c > +++ b/arch/powerpc/sysdev/axonram.c > @@ -156,7 +156,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector, > } > > *kaddr = (void *)(bank->ph_addr + offset); > - *pfn = virt_to_phys(kaddr) >> PAGE_SHIFT; > + *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT; > > return 0; > } > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754041AbaCaK2Q (ORCPT ); Mon, 31 Mar 2014 06:28:16 -0400 Received: from cantor2.suse.de ([195.135.220.15]:45237 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753907AbaCaK2N (ORCPT ); Mon, 31 Mar 2014 06:28:13 -0400 Date: Sat, 29 Mar 2014 16:57:24 +0100 From: Jan Kara To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com Subject: Re: [PATCH v7 01/22] Fix XIP fault vs truncate race Message-ID: <20140329155724.GB1211@quack.suse.cz> References: <59d73a58d4cfbe190a16ce912bb2776d9cc95447.1395591795.git.matthew.r.wilcox@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <59d73a58d4cfbe190a16ce912bb2776d9cc95447.1395591795.git.matthew.r.wilcox@intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 23-03-14 15:08:27, Matthew Wilcox wrote: > Pagecache faults recheck i_size after taking the page lock to ensure that > the fault didn't race against a truncate. We don't have a page to lock > in the XIP case, so use the i_mmap_mutex instead. It is locked in the > truncate path in unmap_mapping_range() after updating i_size. So while > we hold it in the fault path, we are guaranteed that either i_size has > already been updated in the truncate path, or that the truncate will > subsequently call zap_page_range_single() and so remove the mapping we > have just inserted. > > There is a window of time in which i_size has been reduced and the > thread has a mapping to a page which will be removed from the file, > but this is harmless as the page will not be allocated to a different > purpose before the thread's access to it is revoked. The patch looks good. You can add: Reviewed-by: Jan Kara Honza > Signed-off-by: Matthew Wilcox > --- > mm/filemap_xip.c | 24 ++++++++++++++++++++++-- > 1 file changed, 22 insertions(+), 2 deletions(-) > > diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c > index d8d9fe3..c8d23e9 100644 > --- a/mm/filemap_xip.c > +++ b/mm/filemap_xip.c > @@ -260,8 +260,17 @@ again: > __xip_unmap(mapping, vmf->pgoff); > > found: > + /* We must recheck i_size under i_mmap_mutex */ > + mutex_lock(&mapping->i_mmap_mutex); > + size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> > + PAGE_CACHE_SHIFT; > + if (unlikely(vmf->pgoff >= size)) { > + mutex_unlock(&mapping->i_mmap_mutex); > + return VM_FAULT_SIGBUS; > + } > err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address, > xip_pfn); > + mutex_unlock(&mapping->i_mmap_mutex); > if (err == -ENOMEM) > return VM_FAULT_OOM; > /* > @@ -285,16 +294,27 @@ found: > } > if (error != -ENODATA) > goto out; > + > + /* We must recheck i_size under i_mmap_mutex */ > + mutex_lock(&mapping->i_mmap_mutex); > + size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> > + PAGE_CACHE_SHIFT; > + if (unlikely(vmf->pgoff >= size)) { > + ret = VM_FAULT_SIGBUS; > + goto unlock; > + } > /* not shared and writable, use xip_sparse_page() */ > page = xip_sparse_page(); > if (!page) > - goto out; > + goto unlock; > err = vm_insert_page(vma, (unsigned long)vmf->virtual_address, > page); > if (err == -ENOMEM) > - goto out; > + goto unlock; > > ret = VM_FAULT_NOPAGE; > +unlock: > + mutex_unlock(&mapping->i_mmap_mutex); > out: > write_seqcount_end(&xip_sparse_seq); > mutex_unlock(&xip_sparse_mutex); > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753999AbaCaK2P (ORCPT ); Mon, 31 Mar 2014 06:28:15 -0400 Received: from cantor2.suse.de ([195.135.220.15]:45232 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753779AbaCaK2N (ORCPT ); Mon, 31 Mar 2014 06:28:13 -0400 Date: Sat, 29 Mar 2014 17:30:28 +0100 From: Jan Kara To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com Subject: Re: [PATCH v7 04/22] Change direct_access calling convention Message-ID: <20140329163028.GD1211@quack.suse.cz> References: <214af2a38d840d0b8e983d39d03711d1292bc2d6.1395591795.git.matthew.r.wilcox@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <214af2a38d840d0b8e983d39d03711d1292bc2d6.1395591795.git.matthew.r.wilcox@intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 23-03-14 15:08:30, Matthew Wilcox wrote: > In order to support accesses to larger chunks of memory, pass in a > 'size' parameter (counted in bytes), and return the amount available at > that address. > > Signed-off-by: Matthew Wilcox Two minor nits below. Other than that you can add: Reviewed-by: Jan Kara > --- > Documentation/filesystems/xip.txt | 15 +++++++++------ > arch/powerpc/sysdev/axonram.c | 6 +++--- > drivers/block/brd.c | 8 +++++--- > drivers/s390/block/dcssblk.c | 19 ++++++++++--------- > fs/ext2/xip.c | 30 +++++++++++++----------------- > include/linux/blkdev.h | 4 ++-- > 6 files changed, 42 insertions(+), 40 deletions(-) > ... > diff --git a/drivers/block/brd.c b/drivers/block/brd.c > index e73b85c..00da60d 100644 > --- a/drivers/block/brd.c > +++ b/drivers/block/brd.c > @@ -361,8 +361,8 @@ out: > } > > #ifdef CONFIG_BLK_DEV_XIP > -static int brd_direct_access(struct block_device *bdev, sector_t sector, > - void **kaddr, unsigned long *pfn) > +static long brd_direct_access(struct block_device *bdev, sector_t sector, > + void **kaddr, unsigned long *pfn, long size) > { > struct brd_device *brd = bdev->bd_disk->private_data; > struct page *page; > @@ -379,7 +379,9 @@ static int brd_direct_access(struct block_device *bdev, sector_t sector, > *kaddr = page_address(page); > *pfn = page_to_pfn(page); > > - return 0; > + /* Could optimistically check to see if the next page in the > + * file is mapped to the next page of physical RAM */ > + return PAGE_SIZE; This should be min_t(long, PAGE_SIZE, size), shouldn't it? > } > #endif > > diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c > index ebf41e2..da914b2 100644 > --- a/drivers/s390/block/dcssblk.c > +++ b/drivers/s390/block/dcssblk.c > @@ -28,8 +28,8 @@ > static int dcssblk_open(struct block_device *bdev, fmode_t mode); > static void dcssblk_release(struct gendisk *disk, fmode_t mode); > static void dcssblk_make_request(struct request_queue *q, struct bio *bio); > -static int dcssblk_direct_access(struct block_device *bdev, sector_t secnum, > - void **kaddr, unsigned long *pfn); > +static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum, > + void **kaddr, unsigned long *pfn, long size); > > static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0"; > > @@ -866,25 +866,26 @@ fail: > bio_io_error(bio); > } > > -static int > +static long > dcssblk_direct_access (struct block_device *bdev, sector_t secnum, > - void **kaddr, unsigned long *pfn) > + void **kaddr, unsigned long *pfn, long size) > { > struct dcssblk_dev_info *dev_info; > - unsigned long pgoff; > + unsigned long offset, dev_sz; > > dev_info = bdev->bd_disk->private_data; > if (!dev_info) > return -ENODEV; > + dev_sz = dev_info->end - dev_info->start; > if (secnum % (PAGE_SIZE/512)) > return -EINVAL; > - pgoff = secnum / (PAGE_SIZE / 512); > - if ((pgoff+1)*PAGE_SIZE-1 > dev_info->end - dev_info->start) > + offset = secnum * 512; > + if (offset > dev_sz) > return -ERANGE; > - *kaddr = (void *) (dev_info->start+pgoff*PAGE_SIZE); > + *kaddr = (void *) (dev_info->start + offset); > *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT; > > - return 0; > + return min_t(unsigned long, size, dev_sz - offset); ^^^ Why unsigned? Everything seems to be long... Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757394AbaDHPdB (ORCPT ); Tue, 8 Apr 2014 11:33:01 -0400 Received: from cantor2.suse.de ([195.135.220.15]:36452 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756891AbaDHPc5 (ORCPT ); Tue, 8 Apr 2014 11:32:57 -0400 Date: Tue, 8 Apr 2014 17:32:55 +0200 From: Jan Kara To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com Subject: Re: [PATCH v7 05/22] Introduce IS_DAX(inode) Message-ID: <20140408153255.GC2713@quack.suse.cz> References: <6a8918c9a0fb37882179e3699b3e04d96540b24f.1395591795.git.matthew.r.wilcox@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6a8918c9a0fb37882179e3699b3e04d96540b24f.1395591795.git.matthew.r.wilcox@intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 23-03-14 15:08:31, Matthew Wilcox wrote: > Use an inode flag to tag inodes which should avoid using the page cache. > Convert ext2 to use it instead of mapping_is_xip(). The patch looks good. You can add: Reviewed-by: Jan Kara Honza > > Signed-off-by: Matthew Wilcox > --- > fs/ext2/inode.c | 9 ++++++--- > fs/ext2/xip.h | 2 -- > include/linux/fs.h | 6 ++++++ > 3 files changed, 12 insertions(+), 5 deletions(-) > > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index 94ed3684..e7d3192 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -731,7 +731,7 @@ static int ext2_get_blocks(struct inode *inode, > goto cleanup; > } > > - if (ext2_use_xip(inode->i_sb)) { > + if (IS_DAX(inode)) { > /* > * we need to clear the block > */ > @@ -1201,7 +1201,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize) > > inode_dio_wait(inode); > > - if (mapping_is_xip(inode->i_mapping)) > + if (IS_DAX(inode)) > error = xip_truncate_page(inode->i_mapping, newsize); > else if (test_opt(inode->i_sb, NOBH)) > error = nobh_truncate_page(inode->i_mapping, > @@ -1273,7 +1273,8 @@ void ext2_set_inode_flags(struct inode *inode) > { > unsigned int flags = EXT2_I(inode)->i_flags; > > - inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC); > + inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | > + S_DIRSYNC | S_DAX); > if (flags & EXT2_SYNC_FL) > inode->i_flags |= S_SYNC; > if (flags & EXT2_APPEND_FL) > @@ -1284,6 +1285,8 @@ void ext2_set_inode_flags(struct inode *inode) > inode->i_flags |= S_NOATIME; > if (flags & EXT2_DIRSYNC_FL) > inode->i_flags |= S_DIRSYNC; > + if (test_opt(inode->i_sb, XIP)) > + inode->i_flags |= S_DAX; > } > > /* Propagate flags from i_flags to EXT2_I(inode)->i_flags */ > diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h > index 18b34d2..29be737 100644 > --- a/fs/ext2/xip.h > +++ b/fs/ext2/xip.h > @@ -16,9 +16,7 @@ static inline int ext2_use_xip (struct super_block *sb) > } > int ext2_get_xip_mem(struct address_space *, pgoff_t, int, > void **, unsigned long *); > -#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_mem) > #else > -#define mapping_is_xip(map) 0 > #define ext2_xip_verify_sb(sb) do { } while (0) > #define ext2_use_xip(sb) 0 > #define ext2_clear_xip_target(inode, chain) 0 > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 23b2a35..47fd219 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1644,6 +1644,7 @@ struct super_operations { > #define S_IMA 1024 /* Inode has an associated IMA struct */ > #define S_AUTOMOUNT 2048 /* Automount/referral quasi-directory */ > #define S_NOSEC 4096 /* no suid or xattr security attributes */ > +#define S_DAX 8192 /* Direct Access, avoiding the page cache */ > > /* > * Note that nosuid etc flags are inode-specific: setting some file-system > @@ -1681,6 +1682,11 @@ struct super_operations { > #define IS_IMA(inode) ((inode)->i_flags & S_IMA) > #define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT) > #define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC) > +#ifdef CONFIG_FS_XIP > +#define IS_DAX(inode) ((inode)->i_flags & S_DAX) > +#else > +#define IS_DAX(inode) 0 > +#endif > > /* > * Inode state bits. Protected by inode->i_lock > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932301AbaDHQfF (ORCPT ); Tue, 8 Apr 2014 12:35:05 -0400 Received: from cantor2.suse.de ([195.135.220.15]:37762 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757209AbaDHQfB (ORCPT ); Tue, 8 Apr 2014 12:35:01 -0400 Date: Tue, 8 Apr 2014 18:34:57 +0200 From: Jan Kara To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com Subject: Re: [PATCH v7 02/22] Allow page fault handlers to perform the COW Message-ID: <20140408163457.GD2713@quack.suse.cz> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 23-03-14 15:08:28, Matthew Wilcox wrote: > Currently COW of an XIP file is done by first bringing in a read-only > mapping, then retrying the fault and copying the page. It is much more > efficient to tell the fault handler that a COW is being attempted (by > passing in the pre-allocated page in the vm_fault structure), and allow > the handler to perform the COW operation itself. > > Where the filemap code protects against truncation of the file until > the PTE has been installed with the page lock, the XIP code use the > i_mmap_mutex instead. We must therefore unlock the i_mmap_mutex after > inserting the PTE. Eww, leaking of locking details about DAX into generic fault code is really ugly. It seems to me that once you pass the cow_page into the fault handler (which looks OK to me), you can just directly install it in PTE via vm_insert_page() and you don't have to rely on do_cow_fault() for that. Thus you can return VM_FAULT_NOPAGE and be done with it? Basically cow faults will then work the same way as other faults for DAX... Or am I missing something? Honza > Signed-off-by: Matthew Wilcox > --- > include/linux/mm.h | 2 ++ > mm/memory.c | 45 +++++++++++++++++++++++++++++++++------------ > 2 files changed, 35 insertions(+), 12 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index c1b7414..513b78a 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -205,6 +205,7 @@ struct vm_fault { > pgoff_t pgoff; /* Logical page offset based on vma */ > void __user *virtual_address; /* Faulting virtual address */ > > + struct page *cow_page; /* Handler may choose to COW */ > struct page *page; /* ->fault handlers should return a > * page here, unless VM_FAULT_NOPAGE > * is set (which is also implied by > @@ -1010,6 +1011,7 @@ static inline int page_mapped(struct page *page) > #define VM_FAULT_HWPOISON 0x0010 /* Hit poisoned small page */ > #define VM_FAULT_HWPOISON_LARGE 0x0020 /* Hit poisoned large page. Index encoded in upper bits */ > > +#define VM_FAULT_COWED 0x0080 /* ->fault COWed the page instead */ > #define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */ > #define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */ > #define VM_FAULT_RETRY 0x0400 /* ->fault blocked, must retry */ > diff --git a/mm/memory.c b/mm/memory.c > index 07b4287..2a2ecac 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2602,6 +2602,7 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page, > vmf.pgoff = page->index; > vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE; > vmf.page = page; > + vmf.cow_page = NULL; > > ret = vma->vm_ops->page_mkwrite(vma, &vmf); > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) > @@ -3288,7 +3289,8 @@ oom: > } > > static int __do_fault(struct vm_area_struct *vma, unsigned long address, > - pgoff_t pgoff, unsigned int flags, struct page **page) > + pgoff_t pgoff, unsigned int flags, > + struct page *cow_page, struct page **page) > { > struct vm_fault vmf; > int ret; > @@ -3297,10 +3299,13 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address, > vmf.pgoff = pgoff; > vmf.flags = flags; > vmf.page = NULL; > + vmf.cow_page = cow_page; > > ret = vma->vm_ops->fault(vma, &vmf); > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) > return ret; > + if (unlikely(ret & VM_FAULT_COWED)) > + goto out; > > if (unlikely(PageHWPoison(vmf.page))) { > if (ret & VM_FAULT_LOCKED) > @@ -3314,6 +3319,7 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address, > else > VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page); > > + out: > *page = vmf.page; > return ret; > } > @@ -3351,7 +3357,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma, > pte_t *pte; > int ret; > > - ret = __do_fault(vma, address, pgoff, flags, &fault_page); > + ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page); > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) > return ret; > > @@ -3368,6 +3374,12 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma, > return ret; > } > > +/* > + * If the fault handler performs the COW, it does not return a page, > + * so cannot use the page's lock to protect against a concurrent truncate > + * operation. Instead it returns with the i_mmap_mutex held, which must > + * be released after the PTE has been inserted. > + */ > static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma, > unsigned long address, pmd_t *pmd, > pgoff_t pgoff, unsigned int flags, pte_t orig_pte) > @@ -3389,25 +3401,34 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma, > return VM_FAULT_OOM; > } > > - ret = __do_fault(vma, address, pgoff, flags, &fault_page); > + ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page); > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) > goto uncharge_out; > > - copy_user_highpage(new_page, fault_page, address, vma); > + if (!(ret & VM_FAULT_COWED)) > + copy_user_highpage(new_page, fault_page, address, vma); > __SetPageUptodate(new_page); > > pte = pte_offset_map_lock(mm, pmd, address, &ptl); > - if (unlikely(!pte_same(*pte, orig_pte))) { > - pte_unmap_unlock(pte, ptl); > + if (unlikely(!pte_same(*pte, orig_pte))) > + goto unlock_out; > + do_set_pte(vma, address, new_page, pte, true, true); > + pte_unmap_unlock(pte, ptl); > + if (ret & VM_FAULT_COWED) { > + mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex); > + } else { > unlock_page(fault_page); > page_cache_release(fault_page); > - goto uncharge_out; > } > - do_set_pte(vma, address, new_page, pte, true, true); > - pte_unmap_unlock(pte, ptl); > - unlock_page(fault_page); > - page_cache_release(fault_page); > return ret; > +unlock_out: > + pte_unmap_unlock(pte, ptl); > + if (ret & VM_FAULT_COWED) { > + mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex); > + } else { > + unlock_page(fault_page); > + page_cache_release(fault_page); > + } > uncharge_out: > mem_cgroup_uncharge_page(new_page); > page_cache_release(new_page); > @@ -3424,7 +3445,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma, > int dirtied = 0; > int ret, tmp; > > - ret = __do_fault(vma, address, pgoff, flags, &fault_page); > + ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page); > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) > return ret; > > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933039AbaDHR4I (ORCPT ); Tue, 8 Apr 2014 13:56:08 -0400 Received: from cantor2.suse.de ([195.135.220.15]:39306 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933005AbaDHR4E (ORCPT ); Tue, 8 Apr 2014 13:56:04 -0400 Date: Tue, 8 Apr 2014 19:56:00 +0200 From: Jan Kara To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com Subject: Re: [PATCH v7 06/22] Replace XIP read and write with DAX I/O Message-ID: <20140408175600.GE2713@quack.suse.cz> References: <3ebe329d8713f7db4c105021a845316a47a29797.1395591795.git.matthew.r.wilcox@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3ebe329d8713f7db4c105021a845316a47a29797.1395591795.git.matthew.r.wilcox@intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 23-03-14 15:08:32, Matthew Wilcox wrote: > Use the generic AIO infrastructure instead of custom read and write > methods. In addition to giving us support for AIO, this adds the missing > locking between read() and truncate(). > > Signed-off-by: Matthew Wilcox > Reviewed-by: Ross Zwisler In general this looks fine but I have some comments below. > --- > fs/Makefile | 1 + > fs/dax.c | 216 +++++++++++++++++++++++++++++++++++++++++++++++++ > fs/ext2/file.c | 6 +- > fs/ext2/inode.c | 7 +- > include/linux/fs.h | 18 ++++- > mm/filemap.c | 6 +- > mm/filemap_xip.c | 234 ----------------------------------------------------- > 7 files changed, 243 insertions(+), 245 deletions(-) > create mode 100644 fs/dax.c > > diff --git a/fs/Makefile b/fs/Makefile > index 47ac07b..2f194cd 100644 > --- a/fs/Makefile > +++ b/fs/Makefile > @@ -29,6 +29,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o > obj-$(CONFIG_TIMERFD) += timerfd.o > obj-$(CONFIG_EVENTFD) += eventfd.o > obj-$(CONFIG_AIO) += aio.o > +obj-$(CONFIG_FS_XIP) += dax.o > obj-$(CONFIG_FILE_LOCKING) += locks.o > obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o > obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout.o > diff --git a/fs/dax.c b/fs/dax.c > new file mode 100644 > index 0000000..66a6bda > --- /dev/null > +++ b/fs/dax.c > @@ -0,0 +1,216 @@ > +/* > + * fs/dax.c - Direct Access filesystem code > + * Copyright (c) 2013-2014 Intel Corporation > + * Author: Matthew Wilcox > + * Author: Ross Zwisler > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms and conditions of the GNU General Public License, > + * version 2, as published by the Free Software Foundation. > + * > + * This program is distributed in the hope it will be useful, but WITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + * more details. > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +static long dax_get_addr(struct inode *inode, struct buffer_head *bh, > + void **addr) > +{ > + struct block_device *bdev = bh->b_bdev; > + const struct block_device_operations *ops = bdev->bd_disk->fops; > + unsigned long pfn; > + sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9); > + return ops->direct_access(bdev, sector, addr, &pfn, bh->b_size); > +} > + > +static void dax_new_buf(void *addr, unsigned size, unsigned first, > + loff_t offset, loff_t end, int rw) > +{ > + loff_t final = end - offset + first; /* The final byte of the buffer */ > + if (rw != WRITE) { > + memset(addr, 0, size); > + return; > + } It seems counterintuitive to zero out "on-disk" blocks (it seems you'd do this for unwritten blocks) when reading from them. Presumably it could also have undesired effects on endurance of persistent memory. Instead I'd expect that you simply zero out user provided buffer the same way as you do it for holes. > + > + if (first > 0) > + memset(addr, 0, first); > + if (final < size) > + memset(addr + final, 0, size - final); > +} > + > +static bool buffer_written(struct buffer_head *bh) > +{ > + return buffer_mapped(bh) && !buffer_unwritten(bh); > +} > + > +/* > + * When ext4 encounters a hole, it likes to return without modifying the > + * buffer_head which means that we can't trust b_size. To cope with this, > + * we set b_state to 0 before calling get_block and, if any bit is set, we > + * know we can trust b_size. Unfortunate, really, since ext4 does know > + * precisely how long a hole is and would save us time calling get_block > + * repeatedly. Well, this is really a problem of get_blocks() returning the result in struct buffer_head which is used for input as well. I don't think it is actually ext4 specific. > + */ > +static bool buffer_size_valid(struct buffer_head *bh) > +{ > + return bh->b_state != 0; > +} > + > +static ssize_t dax_io(int rw, struct inode *inode, const struct iovec *iov, > + loff_t start, loff_t end, get_block_t get_block, > + struct buffer_head *bh) > +{ > + ssize_t retval = 0; > + unsigned seg = 0; > + unsigned len; > + unsigned copied = 0; > + loff_t offset = start; > + loff_t max = start; > + loff_t bh_max = start; > + void *addr; > + bool hole = false; > + > + if (rw != WRITE) > + end = min(end, i_size_read(inode)); > + > + while (offset < end) { > + void __user *buf = iov[seg].iov_base + copied; > + > + if (offset == max) { > + sector_t block = offset >> inode->i_blkbits; > + unsigned first = offset - (block << inode->i_blkbits); > + long size; > + > + if (offset == bh_max) { > + bh->b_size = PAGE_ALIGN(end - offset); > + bh->b_state = 0; > + retval = get_block(inode, block, bh, > + rw == WRITE); > + if (retval) > + break; > + if (!buffer_size_valid(bh)) > + bh->b_size = 1 << inode->i_blkbits; > + bh_max = offset - first + bh->b_size; > + } else { > + unsigned done = bh->b_size - (bh_max - > + (offset - first)); > + bh->b_blocknr += done >> inode->i_blkbits; > + bh->b_size -= done; It took me quite some time to figure out what this does and whether it is correct :). Why isn't this at the place where we advance all other iterators like offset, addr, etc.? > + } > + if (rw == WRITE) { > + if (!buffer_mapped(bh)) { > + retval = -EIO; > + break; -EIO looks like a wrong error here. Or maybe it is the right one and it only needs some explanation? The thing is that for direct IO some filesystems choose not to fill holes for direct IO and fall back to buffered IO instead (to avoid exposure of uninitialized blocks if the system crashes after blocks have been added to a file but before they were written out). For DAX you are pretty much free to define what you ask from the get_blocks() (and this fallback behavior is somewhat disputed behavior in direct IO case so you might want to differ here) but you should document it somewhere. > + } > + hole = false; > + } else { > + hole = !buffer_written(bh); > + } > + > + if (hole) { > + addr = NULL; > + size = bh->b_size - first; > + } else { > + retval = dax_get_addr(inode, bh, &addr); > + if (retval < 0) > + break; > + if (buffer_unwritten(bh) || buffer_new(bh)) > + dax_new_buf(addr, retval, first, > + offset, end, rw); > + addr += first; > + size = retval - first; > + } > + max = min(offset + size, end); > + } > + > + len = min_t(unsigned, iov[seg].iov_len - copied, max - offset); > + > + if (rw == WRITE) > + len -= __copy_from_user_nocache(addr, buf, len); > + else if (!hole) > + len -= __copy_to_user(buf, addr, len); > + else > + len -= __clear_user(buf, len); > + > + if (!len) > + break; > + > + offset += len; > + copied += len; > + addr += len; > + if (copied == iov[seg].iov_len) { > + seg++; > + copied = 0; > + } > + } > + > + return (offset == start) ? retval : offset - start; > +} > + > +/** > + * dax_do_io - Perform I/O to a DAX file > + * @rw: READ to read or WRITE to write > + * @iocb: The control block for this I/O > + * @inode: The file which the I/O is directed at > + * @iov: The user addresses to do I/O from or to > + * @offset: The file offset where the I/O starts > + * @nr_segs: The length of the iov array > + * @get_block: The filesystem method used to translate file offsets to blocks > + * @end_io: A filesystem callback for I/O completion > + * @flags: See below > + * > + * This function uses the same locking scheme as do_blockdev_direct_IO: > + * If @flags has DIO_LOCKING set, we assume that the i_mutex is held by the > + * caller for writes. For reads, we take and release the i_mutex ourselves. > + * If DIO_LOCKING is not set, the filesystem takes care of its own locking. > + * As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O > + * is in progress. > + */ > +ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode, > + const struct iovec *iov, loff_t offset, unsigned nr_segs, > + get_block_t get_block, dio_iodone_t end_io, int flags) > +{ > + struct buffer_head bh; > + unsigned seg; > + ssize_t retval = -EINVAL; > + loff_t end = offset; > + > + memset(&bh, 0, sizeof(bh)); > + for (seg = 0; seg < nr_segs; seg++) > + end += iov[seg].iov_len; > + > + if ((flags & DIO_LOCKING) && (rw == READ)) { > + struct address_space *mapping = inode->i_mapping; > + mutex_lock(&inode->i_mutex); > + retval = filemap_write_and_wait_range(mapping, offset, end - 1); > + if (retval) { > + mutex_unlock(&inode->i_mutex); > + goto out; > + } Is there a reason for this? I'd assume DAX has no pages in pagecache... > + } > + > + /* Protects against truncate */ > + atomic_inc(&inode->i_dio_count); > + > + retval = dax_io(rw, inode, iov, offset, end, get_block, &bh); > + > + if ((flags & DIO_LOCKING) && (rw == READ)) > + mutex_unlock(&inode->i_mutex); > + > + inode_dio_done(inode); > + > + if ((retval > 0) && end_io) > + end_io(iocb, offset, retval, bh.b_private); > + out: > + return retval; > +} > +EXPORT_SYMBOL_GPL(dax_do_io); > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > index 44c36e5..ef5cf96 100644 > --- a/fs/ext2/file.c > +++ b/fs/ext2/file.c > @@ -81,8 +81,10 @@ const struct file_operations ext2_file_operations = { > #ifdef CONFIG_EXT2_FS_XIP > const struct file_operations ext2_xip_file_operations = { > .llseek = generic_file_llseek, > - .read = xip_file_read, > - .write = xip_file_write, > + .read = do_sync_read, > + .write = do_sync_write, > + .aio_read = generic_file_aio_read, > + .aio_write = generic_file_aio_write, > .unlocked_ioctl = ext2_ioctl, > #ifdef CONFIG_COMPAT > .compat_ioctl = ext2_compat_ioctl, > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index e7d3192..f128ebf 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -858,7 +858,11 @@ ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, > struct inode *inode = mapping->host; > ssize_t ret; > > - ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, > + if (IS_DAX(inode)) > + ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs, > + ext2_get_block, NULL, DIO_LOCKING); > + else > + ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, > ext2_get_block); I'd somewhat prefer to have a ext2_direct_IO() as is and have ext2_dax_IO() call only dax_do_io() (and use that as .direct_io in ext2_aops_xip). Then there's no need to check IS_DAX() and the code would look more obvious to me. But I don't feel strongly about it. > if (ret < 0 && (rw & WRITE)) > ext2_write_failed(mapping, offset + iov_length(iov, nr_segs)); > @@ -888,6 +892,7 @@ const struct address_space_operations ext2_aops = { > const struct address_space_operations ext2_aops_xip = { > .bmap = ext2_bmap, > .get_xip_mem = ext2_get_xip_mem, > + .direct_IO = ext2_direct_IO, > }; > > const struct address_space_operations ext2_nobh_aops = { > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 47fd219..dabc601 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2521,17 +2521,22 @@ extern int generic_file_open(struct inode * inode, struct file * filp); > extern int nonseekable_open(struct inode * inode, struct file * filp); > > #ifdef CONFIG_FS_XIP > -extern ssize_t xip_file_read(struct file *filp, char __user *buf, size_t len, > - loff_t *ppos); > extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma); > -extern ssize_t xip_file_write(struct file *filp, const char __user *buf, > - size_t len, loff_t *ppos); > extern int xip_truncate_page(struct address_space *mapping, loff_t from); > +ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *, > + loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags); > #else > static inline int xip_truncate_page(struct address_space *mapping, loff_t from) > { > return 0; > } > + > +static inline ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode, > + const struct iovec *iov, loff_t offset, unsigned nr_segs, > + get_block_t get_block, dio_iodone_t end_io, int flags) > +{ > + return -ENOTTY; Huh, ENOTTY? I'd expect EOPNOTSUPP or something like that... > +} > #endif > > #ifdef CONFIG_BLOCK > @@ -2681,6 +2686,11 @@ extern int generic_show_options(struct seq_file *m, struct dentry *root); > extern void save_mount_options(struct super_block *sb, char *options); > extern void replace_mount_options(struct super_block *sb, char *options); > > +static inline bool io_is_direct(struct file *filp) > +{ > + return (filp->f_flags & O_DIRECT) || IS_DAX(file_inode(filp)); > +} > + BTW: It seems fs/open.c: open_check_o_direct() can be simplified to not check for get_xip_mem(), cannot it? > static inline ino_t parent_ino(struct dentry *dentry) > { > ino_t res; > diff --git a/mm/filemap.c b/mm/filemap.c > index 7a13f6a..1b7dff6 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1417,8 +1417,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, > if (retval) > return retval; > > - /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */ > - if (filp->f_flags & O_DIRECT) { > + if (io_is_direct(filp)) { > loff_t size; > struct address_space *mapping; > struct inode *inode; > @@ -2468,8 +2467,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, > if (err) > goto out; > > - /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */ > - if (unlikely(file->f_flags & O_DIRECT)) { > + if (io_is_direct(file)) { > loff_t endbyte; > ssize_t written_buffered; > Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757701AbaDHSVG (ORCPT ); Tue, 8 Apr 2014 14:21:06 -0400 Received: from cantor2.suse.de ([195.135.220.15]:39607 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756894AbaDHSVC (ORCPT ); Tue, 8 Apr 2014 14:21:02 -0400 Date: Tue, 8 Apr 2014 20:20:59 +0200 From: Jan Kara To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com Subject: Re: [PATCH v7 10/22] Remove get_xip_mem Message-ID: <20140408182059.GA26019@quack.suse.cz> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 23-03-14 15:08:36, Matthew Wilcox wrote: > All callers of get_xip_mem() are now gone. Remove checks for it, > initialisers of it, documentation of it and the only implementation of it. > > Add documentation for writing a filesystem that supports DAX. > > Signed-off-by: Matthew Wilcox > Reviewed-by: Randy Dunlap The patch looks good. You can add: Reviewed-by: Jan Kara Honza > --- > Documentation/filesystems/Locking | 3 -- > Documentation/filesystems/dax.txt | 82 +++++++++++++++++++++++++++++++++++++++ > Documentation/filesystems/xip.txt | 71 --------------------------------- > fs/exofs/inode.c | 1 - > fs/ext2/inode.c | 1 - > fs/ext2/xip.c | 37 ------------------ > fs/ext2/xip.h | 3 -- > fs/open.c | 5 +-- > include/linux/fs.h | 2 - > mm/fadvise.c | 6 ++- > mm/madvise.c | 2 +- > 11 files changed, 88 insertions(+), 125 deletions(-) > create mode 100644 Documentation/filesystems/dax.txt > delete mode 100644 Documentation/filesystems/xip.txt > > diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking > index 5b0c083..2780d47 100644 > --- a/Documentation/filesystems/Locking > +++ b/Documentation/filesystems/Locking > @@ -194,8 +194,6 @@ prototypes: > void (*freepage)(struct page *); > int (*direct_IO)(int, struct kiocb *, const struct iovec *iov, > loff_t offset, unsigned long nr_segs); > - int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **, > - unsigned long *); > int (*migratepage)(struct address_space *, struct page *, struct page *); > int (*launder_page)(struct page *); > int (*is_partially_uptodate)(struct page *, read_descriptor_t *, unsigned long); > @@ -220,7 +218,6 @@ invalidatepage: yes > releasepage: yes > freepage: yes > direct_IO: > -get_xip_mem: maybe > migratepage: yes (both) > launder_page: yes > is_partially_uptodate: yes > diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt > new file mode 100644 > index 0000000..06f84e5 > --- /dev/null > +++ b/Documentation/filesystems/dax.txt > @@ -0,0 +1,82 @@ > +Execute-in-place for file mappings > +---------------------------------- > + > +Motivation > +---------- > + > +File mappings are usually performed by mapping page cache pages to > +userspace. In addition, read & write file operations also transfer data > +between the page cache and storage. > + > +For memory backed storage devices that use the block device interface, > +the page cache pages are just copies of the original storage. The > +execute-in-place code removes the extra copy by performing reads and > +writes directly on the memory backed storage device. For file mappings, > +the storage device itself is mapped directly into userspace. > + > + > +Implementation Tips for Block Driver Writers > +-------------------------------------------- > + > +To support DAX in your block driver, implement the 'direct_access' > +block device operation. It is used to translate the sector number > +(expressed in units of 512-byte sectors) to a page frame number (pfn) > +that identifies the physical page for the memory. It also returns a > +kernel virtual address that can be used to access the memory. > + > +The direct_access method takes a 'size' parameter that indicates the > +number of bytes being requested. The function should return the number > +of bytes that it can provide, although it must not exceed the number of > +bytes requested. It may also return a negative errno if an error occurs. > + > +In order to support this method, the storage must be byte-accessible by > +the CPU at all times. If your device uses paging techniques to expose > +a large amount of memory through a smaller window, then you cannot > +implement direct_access. Equally, if your device can occasionally > +stall the CPU for an extended period, you should also not attempt to > +implement direct_access. > + > +These block devices may be used for inspiration: > +- axonram: Axon DDR2 device driver > +- brd: RAM backed block device driver > +- dcssblk: s390 dcss block device driver > + > + > +Implementation Tips for Filesystem Writers > +------------------------------------------ > + > +Filesystem support consists of > +- adding support to mark inodes as being DAX by setting the S_DAX flag in > + i_flags > +- implementing the direct_IO address space operation, and calling > + dax_do_io() instead of blockdev_direct_IO() if S_DAX is set > +- implementing an mmap file operation for DAX files which sets the > + VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers > + for fault and page_mkwrite (which should probably call dax_fault() and > + dax_mkwrite(), passing the appropriate get_block() callback) > +- calling dax_truncate_page() instead of block_truncate_page() for DAX files > +- ensuring that there is sufficient locking between reads, writes, > + truncates and page faults > + > +The get_block() callback passed to the DAX functions may return > +uninitialised extents. If it does, it must ensure that simultaneous > +calls to get_block() (for example by a page-fault racing with a read() > +or a write()) work correctly. > + > +These filesystems may be used for inspiration: > +- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt > + > + > +Shortcomings > +------------ > + > +Even if the kernel or its modules are stored on a filesystem that supports > +DAX on a block device that supports DAX, they will still be copied into RAM. > + > +Calling get_user_pages() on a range of user memory that has been mmaped > +from a DAX file will fail as there are no 'struct page' to describe > +those pages. This problem is being worked on. That means that O_DIRECT > +reads/writes to those memory ranges from a non-DAX file will fail (note > +that O_DIRECT reads/writes _of a DAX file_ do work, it is the memory > +that is being accessed that is key here). Other things that will not > +work include RDMA, sendfile() and splice(). > diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt > deleted file mode 100644 > index b62eabf..0000000 > --- a/Documentation/filesystems/xip.txt > +++ /dev/null > @@ -1,71 +0,0 @@ > -Execute-in-place for file mappings > ----------------------------------- > - > -Motivation > ----------- > -File mappings are performed by mapping page cache pages to userspace. In > -addition, read&write type file operations also transfer data from/to the page > -cache. > - > -For memory backed storage devices that use the block device interface, the page > -cache pages are in fact copies of the original storage. Various approaches > -exist to work around the need for an extra copy. The ramdisk driver for example > -does read the data into the page cache, keeps a reference, and discards the > -original data behind later on. > - > -Execute-in-place solves this issue the other way around: instead of keeping > -data in the page cache, the need to have a page cache copy is eliminated > -completely. With execute-in-place, read&write type operations are performed > -directly from/to the memory backed storage device. For file mappings, the > -storage device itself is mapped directly into userspace. > - > -This implementation was initially written for shared memory segments between > -different virtual machines on s390 hardware to allow multiple machines to > -share the same binaries and libraries. > - > -Implementation > --------------- > -Execute-in-place is implemented in three steps: block device operation, > -address space operation, and file operations. > - > -A block device operation named direct_access is used to translate the > -block device sector number to a page frame number (pfn) that identifies > -the physical page for the memory. It also returns a kernel virtual > -address that can be used to access the memory. > - > -The direct_access method takes a 'size' parameter that indicates the > -number of bytes being requested. The function should return the number > -of bytes that it can provide, although it must not exceed the number of > -bytes requested. It may also return a negative errno if an error occurs. > - > -The block device operation is optional, these block devices support it as of > -today: > -- dcssblk: s390 dcss block device driver > - > -An address space operation named get_xip_mem is used to retrieve references > -to a page frame number and a kernel address. To obtain these values a reference > -to an address_space is provided. This function assigns values to the kmem and > -pfn parameters. The third argument indicates whether the function should allocate > -blocks if needed. > - > -This address space operation is mutually exclusive with readpage&writepage that > -do page cache read/write operations. > -The following filesystems support it as of today: > -- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt > - > -A set of file operations that do utilize get_xip_page can be found in > -mm/filemap_xip.c . The following file operation implementations are provided: > -- aio_read/aio_write > -- readv/writev > -- sendfile > - > -The generic file operations do_sync_read/do_sync_write can be used to implement > -classic synchronous IO calls. > - > -Shortcomings > ------------- > -This implementation is limited to storage devices that are cpu addressable at > -all times (no highmem or such). It works well on rom/ram, but enhancements are > -needed to make it work with flash in read+write mode. > -Putting the Linux kernel and/or its modules on a xip filesystem does not mean > -they are not copied. > diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c > index ee4317fa..f9a5bf6 100644 > --- a/fs/exofs/inode.c > +++ b/fs/exofs/inode.c > @@ -985,7 +985,6 @@ const struct address_space_operations exofs_aops = { > .direct_IO = exofs_direct_IO, > > /* With these NULL has special meaning or default is not exported */ > - .get_xip_mem = NULL, > .migratepage = NULL, > .launder_page = NULL, > .is_partially_uptodate = NULL, > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index 252481f..b156fe8 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -891,7 +891,6 @@ const struct address_space_operations ext2_aops = { > > const struct address_space_operations ext2_aops_xip = { > .bmap = ext2_bmap, > - .get_xip_mem = ext2_get_xip_mem, > .direct_IO = ext2_direct_IO, > }; > > diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c > index fa40091..ca745ff 100644 > --- a/fs/ext2/xip.c > +++ b/fs/ext2/xip.c > @@ -22,27 +22,6 @@ static inline long __inode_direct_access(struct inode *inode, sector_t block, > return ops->direct_access(bdev, sector, kaddr, pfn, size); > } > > -static inline int > -__ext2_get_block(struct inode *inode, pgoff_t pgoff, int create, > - sector_t *result) > -{ > - struct buffer_head tmp; > - int rc; > - > - memset(&tmp, 0, sizeof(struct buffer_head)); > - tmp.b_size = 1 << inode->i_blkbits; > - rc = ext2_get_block(inode, pgoff, &tmp, create); > - *result = tmp.b_blocknr; > - > - /* did we get a sparse block (hole in the file)? */ > - if (!tmp.b_blocknr && !rc) { > - BUG_ON(create); > - rc = -ENODATA; > - } > - > - return rc; > -} > - > int > ext2_clear_xip_target(struct inode *inode, sector_t block) > { > @@ -69,19 +48,3 @@ void ext2_xip_verify_sb(struct super_block *sb) > "not supported by bdev"); > } > } > - > -int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create, > - void **kmem, unsigned long *pfn) > -{ > - long rc; > - sector_t block; > - > - /* first, retrieve the sector number */ > - rc = __ext2_get_block(mapping->host, pgoff, create, &block); > - if (rc) > - return rc; > - > - /* retrieve address of the target data */ > - rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE); > - return (rc < 0) ? rc : 0; > -} > diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h > index 29be737..0fa8b7f 100644 > --- a/fs/ext2/xip.h > +++ b/fs/ext2/xip.h > @@ -14,11 +14,8 @@ static inline int ext2_use_xip (struct super_block *sb) > struct ext2_sb_info *sbi = EXT2_SB(sb); > return (sbi->s_mount_opt & EXT2_MOUNT_XIP); > } > -int ext2_get_xip_mem(struct address_space *, pgoff_t, int, > - void **, unsigned long *); > #else > #define ext2_xip_verify_sb(sb) do { } while (0) > #define ext2_use_xip(sb) 0 > #define ext2_clear_xip_target(inode, chain) 0 > -#define ext2_get_xip_mem NULL > #endif > diff --git a/fs/open.c b/fs/open.c > index b9ed8b2..bc9f002 100644 > --- a/fs/open.c > +++ b/fs/open.c > @@ -665,11 +665,8 @@ int open_check_o_direct(struct file *f) > { > /* NB: we're sure to have correct a_ops only after f_op->open */ > if (f->f_flags & O_DIRECT) { > - if (!f->f_mapping->a_ops || > - ((!f->f_mapping->a_ops->direct_IO) && > - (!f->f_mapping->a_ops->get_xip_mem))) { > + if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO) > return -EINVAL; > - } > } > return 0; > } > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 9752ae5..c777056 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -375,8 +375,6 @@ struct address_space_operations { > void (*freepage)(struct page *); > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, > loff_t offset, unsigned long nr_segs); > - int (*get_xip_mem)(struct address_space *, pgoff_t, int, > - void **, unsigned long *); > /* > * migrate the contents of a page to the specified target. If > * migrate_mode is MIGRATE_ASYNC, it must not block. > diff --git a/mm/fadvise.c b/mm/fadvise.c > index 3bcfd81..1f1925f 100644 > --- a/mm/fadvise.c > +++ b/mm/fadvise.c > @@ -28,6 +28,7 @@ > SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) > { > struct fd f = fdget(fd); > + struct inode *inode; > struct address_space *mapping; > struct backing_dev_info *bdi; > loff_t endbyte; /* inclusive */ > @@ -39,7 +40,8 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) > if (!f.file) > return -EBADF; > > - if (S_ISFIFO(file_inode(f.file)->i_mode)) { > + inode = file_inode(f.file); > + if (S_ISFIFO(inode->i_mode)) { > ret = -ESPIPE; > goto out; > } > @@ -50,7 +52,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) > goto out; > } > > - if (mapping->a_ops->get_xip_mem) { > + if (IS_DAX(inode)) { > switch (advice) { > case POSIX_FADV_NORMAL: > case POSIX_FADV_RANDOM: > diff --git a/mm/madvise.c b/mm/madvise.c > index 539eeb9..b6a2f52 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -236,7 +236,7 @@ static long madvise_willneed(struct vm_area_struct *vma, > if (!file) > return -EBADF; > > - if (file->f_mapping->a_ops->get_xip_mem) { > + if (IS_DAX(file_inode(file))) { > /* no bad return value, but ignore advice */ > return 0; > } > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757869AbaDHSVm (ORCPT ); Tue, 8 Apr 2014 14:21:42 -0400 Received: from cantor2.suse.de ([195.135.220.15]:39619 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757707AbaDHSVg (ORCPT ); Tue, 8 Apr 2014 14:21:36 -0400 Date: Tue, 8 Apr 2014 20:21:35 +0200 From: Jan Kara To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com Subject: Re: [PATCH v7 09/22] Remove mm/filemap_xip.c Message-ID: <20140408182135.GB26019@quack.suse.cz> References: <69ab315f0124881ae74d9881c48c7bdc70368fd1.1395591795.git.matthew.r.wilcox@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <69ab315f0124881ae74d9881c48c7bdc70368fd1.1395591795.git.matthew.r.wilcox@intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 23-03-14 15:08:35, Matthew Wilcox wrote: > It is now empty as all of its contents have been replaced by fs/xip.c Looks good. You can add: Reviewed-by: Jan Kara Honza > > Signed-off-by: Matthew Wilcox > --- > mm/Makefile | 1 - > mm/filemap_xip.c | 23 ----------------------- > 2 files changed, 24 deletions(-) > delete mode 100644 mm/filemap_xip.c > > diff --git a/mm/Makefile b/mm/Makefile > index 310c90a..454c176 100644 > --- a/mm/Makefile > +++ b/mm/Makefile > @@ -47,7 +47,6 @@ obj-$(CONFIG_SLUB) += slub.o > obj-$(CONFIG_KMEMCHECK) += kmemcheck.o > obj-$(CONFIG_FAILSLAB) += failslab.o > obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o > -obj-$(CONFIG_FS_XIP) += filemap_xip.o > obj-$(CONFIG_MIGRATION) += migrate.o > obj-$(CONFIG_QUICKLIST) += quicklist.o > obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o > diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c > deleted file mode 100644 > index 6316578..0000000 > --- a/mm/filemap_xip.c > +++ /dev/null > @@ -1,23 +0,0 @@ > -/* > - * linux/mm/filemap_xip.c > - * > - * Copyright (C) 2005 IBM Corporation > - * Author: Carsten Otte > - * > - * derived from linux/mm/filemap.c - Copyright (C) Linus Torvalds > - * > - */ > - > -#include > -#include > -#include > -#include > -#include > -#include > -#include > -#include > -#include > -#include > -#include > -#include > - > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757915AbaDHWFc (ORCPT ); Tue, 8 Apr 2014 18:05:32 -0400 Received: from cantor2.suse.de ([195.135.220.15]:43382 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757295AbaDHWFa (ORCPT ); Tue, 8 Apr 2014 18:05:30 -0400 Date: Wed, 9 Apr 2014 00:05:25 +0200 From: Jan Kara To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Message-ID: <20140408220525.GC26019@quack.suse.cz> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 23-03-14 15:08:33, Matthew Wilcox wrote: > Instead of calling aops->get_xip_mem from the fault handler, the > filesystem passes a get_block_t that is used to find the appropriate > blocks. I have some suggestions below... > Signed-off-by: Matthew Wilcox > --- > fs/dax.c | 207 +++++++++++++++++++++++++++++++++++++++++++++++++++++ > fs/ext2/file.c | 35 ++++++++- > include/linux/fs.h | 4 +- > mm/filemap_xip.c | 206 ---------------------------------------------------- > 4 files changed, 243 insertions(+), 209 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 66a6bda..863749c 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -19,8 +19,12 @@ > #include > #include > #include > +#include > +#include > +#include > #include > #include > +#include > > static long dax_get_addr(struct inode *inode, struct buffer_head *bh, > void **addr) > @@ -32,6 +36,16 @@ static long dax_get_addr(struct inode *inode, struct buffer_head *bh, > return ops->direct_access(bdev, sector, addr, &pfn, bh->b_size); > } > > +static long dax_get_pfn(struct inode *inode, struct buffer_head *bh, > + unsigned long *pfn) > +{ > + struct block_device *bdev = bh->b_bdev; > + const struct block_device_operations *ops = bdev->bd_disk->fops; > + void *addr; > + sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9); > + return ops->direct_access(bdev, sector, &addr, pfn, bh->b_size); > +} > + > static void dax_new_buf(void *addr, unsigned size, unsigned first, > loff_t offset, loff_t end, int rw) > { > @@ -214,3 +228,196 @@ ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode, > return retval; > } > EXPORT_SYMBOL_GPL(dax_do_io); > + > +/* > + * The user has performed a load from a hole in the file. Allocating > + * a new page in the file would cause excessive storage usage for > + * workloads with sparse files. We allocate a page cache page instead. > + * We'll kick it out of the page cache if it's ever written to, > + * otherwise it will simply fall out of the page cache under memory > + * pressure without ever having been dirtied. > + */ > +static int dax_load_hole(struct address_space *mapping, struct page *page, > + struct vm_fault *vmf) > +{ > + unsigned long size; > + struct inode *inode = mapping->host; > + if (!page) > + page = find_or_create_page(mapping, vmf->pgoff, > + GFP_KERNEL | __GFP_ZERO); > + if (!page) > + return VM_FAULT_OOM; > + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; > + if (vmf->pgoff >= size) { Maybe comment here that we have to recheck i_size so that we don't create pages in the area truncate_pagecache() has already evicted. > + unlock_page(page); > + page_cache_release(page); > + return VM_FAULT_SIGBUS; > + } > + > + vmf->page = page; > + return VM_FAULT_LOCKED; > +} > + > +static void copy_user_bh(struct page *to, struct inode *inode, > + struct buffer_head *bh, unsigned long vaddr) > +{ > + void *vfrom, *vto; > + dax_get_addr(inode, bh, &vfrom); /* XXX: error handling */ The error handling here is missing as the comment suggests :) > + vto = kmap_atomic(to); > + copy_user_page(vto, vfrom, vaddr, to); > + kunmap_atomic(vto); > +} > + > +static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > + get_block_t get_block) > +{ > + struct file *file = vma->vm_file; > + struct inode *inode = file_inode(file); > + struct address_space *mapping = file->f_mapping; > + struct page *page; > + struct buffer_head bh; > + unsigned long vaddr = (unsigned long)vmf->virtual_address; > + sector_t block; > + pgoff_t size; > + unsigned long pfn; > + int error; > + int major = 0; > + > + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; > + if (vmf->pgoff >= size) > + return VM_FAULT_SIGBUS; > + > + memset(&bh, 0, sizeof(bh)); > + block = (sector_t)vmf->pgoff << (PAGE_SHIFT - inode->i_blkbits); > + bh.b_size = PAGE_SIZE; > + > + repeat: > + page = find_get_page(mapping, vmf->pgoff); > + if (page) { > + if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) { > + page_cache_release(page); > + return VM_FAULT_RETRY; > + } > + if (unlikely(page->mapping != mapping)) { > + unlock_page(page); > + page_cache_release(page); > + goto repeat; > + } > + } > + > + error = get_block(inode, block, &bh, 0); > + if (error || bh.b_size < PAGE_SIZE) > + goto sigbus; > + > + if (!buffer_written(&bh) && !vmf->cow_page) { > + if (vmf->flags & FAULT_FLAG_WRITE) { > + error = get_block(inode, block, &bh, 1); > + count_vm_event(PGMAJFAULT); > + mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); > + major = VM_FAULT_MAJOR; > + if (error || bh.b_size < PAGE_SIZE) > + goto sigbus; > + } else { > + return dax_load_hole(mapping, page, vmf); > + } > + } > + > + /* Recheck i_size under i_mmap_mutex */ > + mutex_lock(&mapping->i_mmap_mutex); > + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; > + if (unlikely(vmf->pgoff >= size)) { > + mutex_unlock(&mapping->i_mmap_mutex); > + goto sigbus; > + } > + if (vmf->cow_page) { > + if (buffer_written(&bh)) > + copy_user_bh(vmf->cow_page, inode, &bh, vaddr); > + else > + clear_user_highpage(vmf->cow_page, vaddr); > + if (page) { > + unlock_page(page); > + page_cache_release(page); > + } > + /* do_cow_fault() will release the i_mmap_mutex */ > + return VM_FAULT_COWED; > + } > + > + if (buffer_unwritten(&bh) || buffer_new(&bh)) > + dax_clear_blocks(inode, bh.b_blocknr, bh.b_size); Where is dax_clear_blocks() defined? > + > + error = dax_get_pfn(inode, &bh, &pfn); > + if (error > 0) > + error = vm_insert_mixed(vma, vaddr, pfn); When there's a hole (thus page != NULL) and we are called from dax_mkwrite(), this will always return EBUSY, correct? > + mutex_unlock(&mapping->i_mmap_mutex); > + > + if (page) { > + delete_from_page_cache(page); > + unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, > + PAGE_CACHE_SIZE, 0); Here we unmap the PTE pointing to the hole page but then we'll have to retry the fault again to fill in the pfn we've got? This seems wrong. I'd say we want to remap the PTE from the hole page to a pfn we've got while holding i_mmap_mutex. remap_pfn_range() almost does what you need, except that you also need that to work for normal pages. So you might need to create a new helper in mm layer for that. > + unlock_page(page); > + page_cache_release(page); > + } > + > + if (error == -ENOMEM) > + return VM_FAULT_OOM; > + /* -EBUSY is fine, somebody else faulted on the same PTE */ > + if (error != -EBUSY) > + BUG_ON(error); > + return VM_FAULT_NOPAGE | major; > + > + sigbus: > + if (page) { > + unlock_page(page); > + page_cache_release(page); > + } > + return VM_FAULT_SIGBUS; > +} > + > +/** > + * dax_fault - handle a page fault on an XIP file > + * @vma: The virtual memory area where the fault occurred > + * @vmf: The description of the fault > + * @get_block: The filesystem method used to translate file offsets to blocks > + * > + * When a page fault occurs, filesystems may call this helper in their > + * fault handler for XIP files. > + */ > +int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > + get_block_t get_block) > +{ > + int result; > + struct super_block *sb = file_inode(vma->vm_file)->i_sb; > + > + sb_start_pagefault(sb); You don't need any filesystem freeze protection for the fault handler since that's not going to modify the filesystem. > + file_update_time(vma->vm_file); Why do you update m/ctime? We are only reading the file... > + result = do_dax_fault(vma, vmf, get_block); > + sb_end_pagefault(sb); > + > + return result; > +} > +EXPORT_SYMBOL_GPL(dax_fault); > + > +/** > + * dax_mkwrite - convert a read-only page to read-write in an XIP file > + * @vma: The virtual memory area where the fault occurred > + * @vmf: The description of the fault > + * @get_block: The filesystem method used to translate file offsets to blocks > + * > + * XIP handles reads of holes by adding pages full of zeroes into the > + * mapping. If the page is subsequenty written to, we have to allocate > + * the page on media and free the page that was in the cache. > + */ > +int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, > + get_block_t get_block) > +{ > + int result; > + struct super_block *sb = file_inode(vma->vm_file)->i_sb; > + > + sb_start_pagefault(sb); > + file_update_time(vma->vm_file); > + result = do_dax_fault(vma, vmf, get_block); > + sb_end_pagefault(sb); > + > + return result; > +} > +EXPORT_SYMBOL_GPL(dax_mkwrite); > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > index ef5cf96..e3ce10d 100644 > --- a/fs/ext2/file.c > +++ b/fs/ext2/file.c > @@ -25,6 +25,37 @@ > #include "xattr.h" > #include "acl.h" > > +#ifdef CONFIG_EXT2_FS_XIP > +static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf) > +{ > + return dax_fault(vma, vmf, ext2_get_block); > +} > + > +static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > +{ > + return dax_mkwrite(vma, vmf, ext2_get_block); > +} > + > +static const struct vm_operations_struct ext2_dax_vm_ops = { > + .fault = ext2_dax_fault, > + .page_mkwrite = ext2_dax_mkwrite, > + .remap_pages = generic_file_remap_pages, > +}; > + > +static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma) > +{ > + if (!IS_DAX(file_inode(file))) > + return generic_file_mmap(file, vma); > + > + file_accessed(file); > + vma->vm_ops = &ext2_dax_vm_ops; > + vma->vm_flags |= VM_MIXEDMAP; > + return 0; > +} > +#else > +#define ext2_file_mmap generic_file_mmap > +#endif > + > /* > * Called when filp is released. This happens when all file descriptors > * for a single struct file are closed. Note that different open() calls > @@ -70,7 +101,7 @@ const struct file_operations ext2_file_operations = { > #ifdef CONFIG_COMPAT > .compat_ioctl = ext2_compat_ioctl, > #endif > - .mmap = generic_file_mmap, > + .mmap = ext2_file_mmap, So what's the point of ext2_file_operations ever handling IS_DAX() inodes? Actually ext2_file_operations and ext2_xip_file_operations seem to be the same after this patch so either you drop ext2_xip_file_operations (I'm for this) or you can leave generic_file_mmap here and assume ext2_file_mmap is always called for IS_DAX() inodes. > .open = dquot_file_open, > .release = ext2_release_file, > .fsync = ext2_fsync, > @@ -89,7 +120,7 @@ const struct file_operations ext2_xip_file_operations = { > #ifdef CONFIG_COMPAT > .compat_ioctl = ext2_compat_ioctl, > #endif > - .mmap = xip_file_mmap, > + .mmap = ext2_file_mmap, > .open = dquot_file_open, > .release = ext2_release_file, > .fsync = ext2_fsync, Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757675AbaDHWSG (ORCPT ); Tue, 8 Apr 2014 18:18:06 -0400 Received: from cantor2.suse.de ([195.135.220.15]:43527 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750789AbaDHWSC (ORCPT ); Tue, 8 Apr 2014 18:18:02 -0400 Date: Wed, 9 Apr 2014 00:17:59 +0200 From: Jan Kara To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com Subject: Re: [PATCH v7 08/22] Replace xip_truncate_page with dax_truncate_page Message-ID: <20140408221759.GD26019@quack.suse.cz> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 23-03-14 15:08:34, Matthew Wilcox wrote: > It takes a get_block parameter just like nobh_truncate_page() and > block_truncate_page() The patch looks mostly OK. Some minor comments below. > > Signed-off-by: Matthew Wilcox > --- > fs/dax.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++---- > fs/ext2/inode.c | 2 +- > include/linux/fs.h | 4 ++-- > mm/filemap_xip.c | 40 ---------------------------------------- > 4 files changed, 51 insertions(+), 47 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 863749c..7271be0 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -374,13 +374,13 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > } > > /** > - * dax_fault - handle a page fault on an XIP file > + * dax_fault - handle a page fault on a DAX file > * @vma: The virtual memory area where the fault occurred > * @vmf: The description of the fault > * @get_block: The filesystem method used to translate file offsets to blocks > * > * When a page fault occurs, filesystems may call this helper in their > - * fault handler for XIP files. > + * fault handler for DAX files. > */ > int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > get_block_t get_block) > @@ -398,12 +398,12 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > EXPORT_SYMBOL_GPL(dax_fault); > > /** > - * dax_mkwrite - convert a read-only page to read-write in an XIP file > + * dax_mkwrite - convert a read-only page to read-write in a DAX file > * @vma: The virtual memory area where the fault occurred > * @vmf: The description of the fault > * @get_block: The filesystem method used to translate file offsets to blocks > * > - * XIP handles reads of holes by adding pages full of zeroes into the > + * DAX handles reads of holes by adding pages full of zeroes into the > * mapping. If the page is subsequenty written to, we have to allocate > * the page on media and free the page that was in the cache. > */ Above two hunks belong to the previous patch... > @@ -421,3 +421,47 @@ int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, > return result; > } > EXPORT_SYMBOL_GPL(dax_mkwrite); > + > +/** > + * dax_truncate_page - handle a partial page being truncated in a DAX file > + * @inode: The file being truncated > + * @from: The file offset that is being truncated to > + * @get_block: The filesystem method used to translate file offsets to blocks > + * > + * Similar to block_truncate_page(), this function can be called by a > + * filesystem when it is truncating an DAX file to handle the partial page. > + * > + * We work in terms of PAGE_CACHE_SIZE here for commonality with > + * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem > + * took care of disposing of the unnecessary blocks. Even if the filesystem > + * block size is smaller than PAGE_SIZE, we have to zero the rest of the page > + * since the file might be mmaped. Well, DAX mmap support pretty much relies on PAGE_CACHE_SIZE == block size (we cannot really map only a part of a physical page directly...). So the comment seems somewhat misleading. > + */ > +int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) > +{ > + struct buffer_head bh; > + pgoff_t index = from >> PAGE_CACHE_SHIFT; > + unsigned offset = from & (PAGE_CACHE_SIZE-1); > + unsigned length = PAGE_CACHE_ALIGN(from) - from; > + int err; > + Can we WARN_ON_ONCE here if PAGE_CACHE_SHIFT != inode->i_blkbits? Just to catch bugs early. > + /* Block boundary? Nothing to do */ > + if (!length) > + return 0; > + > + memset(&bh, 0, sizeof(bh)); > + bh.b_size = PAGE_CACHE_SIZE; > + err = get_block(inode, index, &bh, 0); > + if (err < 0) > + return err; > + if (buffer_written(&bh)) { > + void *addr; > + err = dax_get_addr(inode, &bh, &addr); > + if (err) > + return err; > + memset(addr + offset, 0, length); > + } > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(dax_truncate_page); Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758041AbaDHXRW (ORCPT ); Tue, 8 Apr 2014 19:17:22 -0400 Received: from mga09.intel.com ([134.134.136.24]:20573 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756739AbaDHXRU (ORCPT ); Tue, 8 Apr 2014 19:17:20 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,820,1389772800"; d="scan'208";a="489455611" Date: Tue, 8 Apr 2014 16:21:02 -0400 From: Matthew Wilcox To: Jan Kara Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 06/22] Replace XIP read and write with DAX I/O Message-ID: <20140408202102.GB5727@linux.intel.com> References: <3ebe329d8713f7db4c105021a845316a47a29797.1395591795.git.matthew.r.wilcox@intel.com> <20140408175600.GE2713@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140408175600.GE2713@quack.suse.cz> User-Agent: Mutt/1.5.22 (2013-10-16) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 08, 2014 at 07:56:00PM +0200, Jan Kara wrote: > > +static void dax_new_buf(void *addr, unsigned size, unsigned first, > > + loff_t offset, loff_t end, int rw) > > +{ > > + loff_t final = end - offset + first; /* The final byte of the buffer */ > > + if (rw != WRITE) { > > + memset(addr, 0, size); > > + return; > > + } > It seems counterintuitive to zero out "on-disk" blocks (it seems you'd do > this for unwritten blocks) when reading from them. Presumably it could also > have undesired effects on endurance of persistent memory. Instead I'd expect > that you simply zero out user provided buffer the same way as you do it for > holes. I think we have to zero it here, because the second time we call get_block() for a given block, it won't be BH_New any more, so we won't know that it's supposed to be zeroed. > > +/* > > + * When ext4 encounters a hole, it likes to return without modifying the > > + * buffer_head which means that we can't trust b_size. To cope with this, > > + * we set b_state to 0 before calling get_block and, if any bit is set, we > > + * know we can trust b_size. Unfortunate, really, since ext4 does know > > + * precisely how long a hole is and would save us time calling get_block > > + * repeatedly. > Well, this is really a problem of get_blocks() returning the result in > struct buffer_head which is used for input as well. I don't think it is > actually ext4 specific. Of course it's ext4 specific! It's the ext4_get_block() implementation which is choosing not to return the length of the hole. XFS does return the length of the hole. I think something like this would fix it: +++ b/fs/ext4/inode.c @@ -727,14 +727,14 @@ static int _ext4_get_block(struct inode *inode, sector_t i } ret = ext4_map_blocks(handle, inode, &map, flags); + map_bh(bh, inode->i_sb, map.m_pblk); + bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags; + bh->b_size = inode->i_sb->s_blocksize * map.m_len; if (ret > 0) { ext4_io_end_t *io_end = ext4_inode_aio(inode); - map_bh(bh, inode->i_sb, map.m_pblk); - bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags; if (io_end && io_end->flag & EXT4_IO_END_UNWRITTEN) set_buffer_defer_completion(bh); - bh->b_size = inode->i_sb->s_blocksize * map.m_len; ret = 0; } if (started) (completely untested). > > + while (offset < end) { > > + void __user *buf = iov[seg].iov_base + copied; > > + > > + if (offset == max) { > > + sector_t block = offset >> inode->i_blkbits; > > + unsigned first = offset - (block << inode->i_blkbits); > > + long size; > > + > > + if (offset == bh_max) { > > + bh->b_size = PAGE_ALIGN(end - offset); > > + bh->b_state = 0; > > + retval = get_block(inode, block, bh, > > + rw == WRITE); > > + if (retval) > > + break; > > + if (!buffer_size_valid(bh)) > > + bh->b_size = 1 << inode->i_blkbits; > > + bh_max = offset - first + bh->b_size; > > + } else { > > + unsigned done = bh->b_size - (bh_max - > > + (offset - first)); > > + bh->b_blocknr += done >> inode->i_blkbits; > > + bh->b_size -= done; > It took me quite some time to figure out what this does and whether it is > correct :). Why isn't this at the place where we advance all other > iterators like offset, addr, etc.? It'll be kind of tricky to move it because 'len' is not necessarily a multiple of i_blkbits, so we can't necessarily maintain b_blocknr accurately. > > + if (rw == WRITE) { > > + if (!buffer_mapped(bh)) { > > + retval = -EIO; > > + break; > -EIO looks like a wrong error here. Or maybe it is the right one and it > only needs some explanation? The thing is that for direct IO some > filesystems choose not to fill holes for direct IO and fall back to > buffered IO instead (to avoid exposure of uninitialized blocks if the > system crashes after blocks have been added to a file but before they were > written out). For DAX you are pretty much free to define what you ask from > the get_blocks() (and this fallback behavior is somewhat disputed behavior > in direct IO case so you might want to differ here) but you should document > it somewhere. Hmm ... I thought that calling get_block() with the create argument would force the return of a bh with the Mapped bit set. Did I misunderstand that aspect of the undocumented get_block() API too? > > + if ((flags & DIO_LOCKING) && (rw == READ)) { > > + struct address_space *mapping = inode->i_mapping; > > + mutex_lock(&inode->i_mutex); > > + retval = filemap_write_and_wait_range(mapping, offset, end - 1); > > + if (retval) { > > + mutex_unlock(&inode->i_mutex); > > + goto out; > > + } > Is there a reason for this? I'd assume DAX has no pages in pagecache... There will be pages in the page cache for holes that we page faulted on. They must go! :-) > > @@ -858,7 +858,11 @@ ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, > > struct inode *inode = mapping->host; > > ssize_t ret; > > > > - ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, > > + if (IS_DAX(inode)) > > + ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs, > > + ext2_get_block, NULL, DIO_LOCKING); > > + else > > + ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, > > ext2_get_block); > I'd somewhat prefer to have a ext2_direct_IO() as is and have > ext2_dax_IO() call only dax_do_io() (and use that as .direct_io in > ext2_aops_xip). Then there's no need to check IS_DAX() and the code would > look more obvious to me. But I don't feel strongly about it. I can look at that ... but I was hoping to not have separate aops for XIP and non-XIP files. > > @@ -2681,6 +2686,11 @@ extern int generic_show_options(struct seq_file *m, struct dentry *root); > > extern void save_mount_options(struct super_block *sb, char *options); > > extern void replace_mount_options(struct super_block *sb, char *options); > > > > +static inline bool io_is_direct(struct file *filp) > > +{ > > + return (filp->f_flags & O_DIRECT) || IS_DAX(file_inode(filp)); > > +} > > + > BTW: It seems fs/open.c: open_check_o_direct() can be simplified to not > check for get_xip_mem(), cannot it? That's in a later patch From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932726AbaDIJPB (ORCPT ); Wed, 9 Apr 2014 05:15:01 -0400 Received: from cantor2.suse.de ([195.135.220.15]:50072 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932465AbaDIJOz (ORCPT ); Wed, 9 Apr 2014 05:14:55 -0400 Date: Wed, 9 Apr 2014 11:14:50 +0200 From: Jan Kara To: Matthew Wilcox Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 06/22] Replace XIP read and write with DAX I/O Message-ID: <20140409091450.GA32103@quack.suse.cz> References: <3ebe329d8713f7db4c105021a845316a47a29797.1395591795.git.matthew.r.wilcox@intel.com> <20140408175600.GE2713@quack.suse.cz> <20140408202102.GB5727@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140408202102.GB5727@linux.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 08-04-14 16:21:02, Matthew Wilcox wrote: > On Tue, Apr 08, 2014 at 07:56:00PM +0200, Jan Kara wrote: > > > +static void dax_new_buf(void *addr, unsigned size, unsigned first, > > > + loff_t offset, loff_t end, int rw) > > > +{ > > > + loff_t final = end - offset + first; /* The final byte of the buffer */ > > > + if (rw != WRITE) { > > > + memset(addr, 0, size); > > > + return; > > > + } > > It seems counterintuitive to zero out "on-disk" blocks (it seems you'd do > > this for unwritten blocks) when reading from them. Presumably it could also > > have undesired effects on endurance of persistent memory. Instead I'd expect > > that you simply zero out user provided buffer the same way as you do it for > > holes. > > I think we have to zero it here, because the second time we call > get_block() for a given block, it won't be BH_New any more, so we won't > know that it's supposed to be zeroed. But how can you have BH_New buffer when you didn't ask get_blocks() to create any block? That would be a bug in the get_blocks() implementation... Or am I missing something? > > > +/* > > > + * When ext4 encounters a hole, it likes to return without modifying the > > > + * buffer_head which means that we can't trust b_size. To cope with this, > > > + * we set b_state to 0 before calling get_block and, if any bit is set, we > > > + * know we can trust b_size. Unfortunate, really, since ext4 does know > > > + * precisely how long a hole is and would save us time calling get_block > > > + * repeatedly. > > Well, this is really a problem of get_blocks() returning the result in > > struct buffer_head which is used for input as well. I don't think it is > > actually ext4 specific. > > Of course it's ext4 specific! It's the ext4_get_block() implementation > which is choosing not to return the length of the hole. XFS does return > the length of the hole. I think something like this would fix it: OK, but there are filesystems which do the same thing as ext4 (e.g. btrfs) and historically noone really cared. E.g. direct IO code advances only by a single block regardless of what filesystem returns when the buffer is unmapped. As you correctly mention, get_blocks() API isn't really documented so noone has really defined what should happen when you ask filesystem to map some blocks and there's a hole. I agree what XFS does looks sensible and ext4 can do the same. Hopefully this gets cleaned up when Dave finishes his new block mapping interface. > +++ b/fs/ext4/inode.c > @@ -727,14 +727,14 @@ static int _ext4_get_block(struct inode *inode, sector_t i > } > > ret = ext4_map_blocks(handle, inode, &map, flags); > + map_bh(bh, inode->i_sb, map.m_pblk); > + bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags; > + bh->b_size = inode->i_sb->s_blocksize * map.m_len; > if (ret > 0) { > ext4_io_end_t *io_end = ext4_inode_aio(inode); > > - map_bh(bh, inode->i_sb, map.m_pblk); > - bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags; > if (io_end && io_end->flag & EXT4_IO_END_UNWRITTEN) > set_buffer_defer_completion(bh); > - bh->b_size = inode->i_sb->s_blocksize * map.m_len; > ret = 0; > } > if (started) This wouldn't quite work because even ext4_map_blocks() doesn't bother to fill in 'map' when it finds a hole. But it won't be complicated to propagate the information. > > > + while (offset < end) { > > > + void __user *buf = iov[seg].iov_base + copied; > > > + > > > + if (offset == max) { > > > + sector_t block = offset >> inode->i_blkbits; > > > + unsigned first = offset - (block << inode->i_blkbits); > > > + long size; > > > + > > > + if (offset == bh_max) { > > > + bh->b_size = PAGE_ALIGN(end - offset); > > > + bh->b_state = 0; > > > + retval = get_block(inode, block, bh, > > > + rw == WRITE); > > > + if (retval) > > > + break; > > > + if (!buffer_size_valid(bh)) > > > + bh->b_size = 1 << inode->i_blkbits; > > > + bh_max = offset - first + bh->b_size; > > > + } else { > > > + unsigned done = bh->b_size - (bh_max - > > > + (offset - first)); > > > + bh->b_blocknr += done >> inode->i_blkbits; > > > + bh->b_size -= done; > > It took me quite some time to figure out what this does and whether it is > > correct :). Why isn't this at the place where we advance all other > > iterators like offset, addr, etc.? > > It'll be kind of tricky to move it because 'len' is not necessarily > a multiple of i_blkbits, so we can't necessarily maintain b_blocknr > accurately. Yeah, after I understood the code I also understood why you do it the way you did. But we could do something like: ... + if (!len) + break; + blocks = ((offset + len) >> inode->i_blkbits) - (offset >> inode->i_blkbits); bh->b_blocknr += blocks; bh->b_size -= blocks << inode->i_blkbits; + offset += len; + copied += len; + addr += len; ... BTW: it might be good to store inode->i_blkbits in a local variable. It makes some expressions shorter. BTW2: although direct IO uses 'offset' for position in file, the rest of VFS uses 'pos' for that and that seems to be less overloaded term so for me it would be easier if you used 'pos' instead of 'offset'. Just a suggestion. > > > + if (rw == WRITE) { > > > + if (!buffer_mapped(bh)) { > > > + retval = -EIO; > > > + break; > > -EIO looks like a wrong error here. Or maybe it is the right one and it > > only needs some explanation? The thing is that for direct IO some > > filesystems choose not to fill holes for direct IO and fall back to > > buffered IO instead (to avoid exposure of uninitialized blocks if the > > system crashes after blocks have been added to a file but before they were > > written out). For DAX you are pretty much free to define what you ask from > > the get_blocks() (and this fallback behavior is somewhat disputed behavior > > in direct IO case so you might want to differ here) but you should document > > it somewhere. > > Hmm ... I thought that calling get_block() with the create argument would > force the return of a bh with the Mapped bit set. Did I misunderstand that > aspect of the undocumented get_block() API too? As you mention the API is undocumented and not really designed. So filesystems do whatever causes the generic code to do what they want (it's a mess I know). In this case, I'm warning you there are filesystems which refuse to fill in holes from the get_blocks() function passed to blockdev_direct_IO() (even ext4 does this for inodes with old indirect-block based on disk format). You can just define DAX fails horribly in these case and I'm fine with that at least in this stage. If someone bothers later, fallback to buffered IO can be implemented. But we should document this somewhere. > > > + if ((flags & DIO_LOCKING) && (rw == READ)) { > > > + struct address_space *mapping = inode->i_mapping; > > > + mutex_lock(&inode->i_mutex); > > > + retval = filemap_write_and_wait_range(mapping, offset, end - 1); > > > + if (retval) { > > > + mutex_unlock(&inode->i_mutex); > > > + goto out; > > > + } > > Is there a reason for this? I'd assume DAX has no pages in pagecache... > > There will be pages in the page cache for holes that we page faulted on. > They must go! :-) Well, but this will only writeback dirty pages and if I read the code correctly those pages will never be dirty since dax_mkwrite() will replace them. Or am I missing something? > > > @@ -858,7 +858,11 @@ ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, > > > struct inode *inode = mapping->host; > > > ssize_t ret; > > > > > > - ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, > > > + if (IS_DAX(inode)) > > > + ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs, > > > + ext2_get_block, NULL, DIO_LOCKING); > > > + else > > > + ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, > > > ext2_get_block); > > I'd somewhat prefer to have a ext2_direct_IO() as is and have > > ext2_dax_IO() call only dax_do_io() (and use that as .direct_io in > > ext2_aops_xip). Then there's no need to check IS_DAX() and the code would > > look more obvious to me. But I don't feel strongly about it. > > I can look at that ... but I was hoping to not have separate aops for > XIP and non-XIP files. OK, if you can do that, then I'm fine with the code as is. Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758427AbaDIJ0l (ORCPT ); Wed, 9 Apr 2014 05:26:41 -0400 Received: from cantor2.suse.de ([195.135.220.15]:50320 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758121AbaDIJ0k (ORCPT ); Wed, 9 Apr 2014 05:26:40 -0400 Date: Wed, 9 Apr 2014 11:26:35 +0200 From: Jan Kara To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com Subject: Re: [PATCH v7 08/22] Replace xip_truncate_page with dax_truncate_page Message-ID: <20140409092635.GB32103@quack.suse.cz> References: <20140408221759.GD26019@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140408221759.GD26019@quack.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 09-04-14 00:17:59, Jan Kara wrote: > On Sun 23-03-14 15:08:34, Matthew Wilcox wrote: > > +/** > > + * dax_truncate_page - handle a partial page being truncated in a DAX file > > + * @inode: The file being truncated > > + * @from: The file offset that is being truncated to > > + * @get_block: The filesystem method used to translate file offsets to blocks > > + * > > + * Similar to block_truncate_page(), this function can be called by a > > + * filesystem when it is truncating an DAX file to handle the partial page. > > + * > > + * We work in terms of PAGE_CACHE_SIZE here for commonality with > > + * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem > > + * took care of disposing of the unnecessary blocks. Even if the filesystem > > + * block size is smaller than PAGE_SIZE, we have to zero the rest of the page > > + * since the file might be mmaped. > Well, DAX mmap support pretty much relies on PAGE_CACHE_SIZE == block > size (we cannot really map only a part of a physical page directly...). So > the comment seems somewhat misleading. I thought about this for a while and classical IO, truncation etc. could easily work for blocksize < pagesize. And for mmap() you could just use pagecache. Not sure if it's worth the complications though. Anyway we should decide whether we don't care about blocksize < PAGE_CACHE_SIZE at all, or whether we try to make things which can work reasonably easily functional. In that case dax_truncate_page() needs some tweaking because it currently assumes blocksize == PAGE_CACHE_SIZE. Honza > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932733AbaDIJqu (ORCPT ); Wed, 9 Apr 2014 05:46:50 -0400 Received: from cantor2.suse.de ([195.135.220.15]:50843 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758209AbaDIJqs (ORCPT ); Wed, 9 Apr 2014 05:46:48 -0400 Date: Wed, 9 Apr 2014 11:46:44 +0200 From: Jan Kara To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com Subject: Re: [PATCH v7 11/22] Replace ext2_clear_xip_target with dax_clear_blocks Message-ID: <20140409094644.GD32103@quack.suse.cz> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 23-03-14 15:08:37, Matthew Wilcox wrote: > This is practically generic code; other filesystems will want to call > it from other places, but there's nothing ext2-specific about it. > > Make it a little more generic by allowing it to take a count of the number > of bytes to zero rather than fixing it to a single page. Thanks to Dave > Hansen for suggesting that I need to call cond_resched() if zeroing more > than one page. Another day, some more review ;) Comments below. > > Signed-off-by: Matthew Wilcox > --- > fs/dax.c | 34 ++++++++++++++++++++++++++++++++++ > fs/ext2/inode.c | 8 +++++--- > fs/ext2/xip.c | 23 ----------------------- > fs/ext2/xip.h | 3 --- > include/linux/fs.h | 6 ++++++ > 5 files changed, 45 insertions(+), 29 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 7271be0..45a0a41 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -23,9 +23,43 @@ > #include > #include > #include > +#include > #include > #include > > +int dax_clear_blocks(struct inode *inode, sector_t block, long size) > +{ > + struct block_device *bdev = inode->i_sb->s_bdev; > + const struct block_device_operations *ops = bdev->bd_disk->fops; > + sector_t sector = block << (inode->i_blkbits - 9); > + unsigned long pfn; > + > + might_sleep(); > + do { > + void *addr; > + long count = ops->direct_access(bdev, sector, &addr, &pfn, > + size); So do you assume blocksize == PAGE_SIZE here? If not, addr could be in the middle of the page AFAICT. > + if (count < 0) > + return count; > + while (count >= PAGE_SIZE) { > + clear_page(addr); > + addr += PAGE_SIZE; > + size -= PAGE_SIZE; > + count -= PAGE_SIZE; > + sector += PAGE_SIZE / 512; > + cond_resched(); > + } > + if (count > 0) { > + memset(addr, 0, count); > + sector += count / 512; > + size -= count; > + } > + } while (size); > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(dax_clear_blocks); > + > static long dax_get_addr(struct inode *inode, struct buffer_head *bh, > void **addr) > { > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index b156fe8..a9346a9 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -733,10 +733,12 @@ static int ext2_get_blocks(struct inode *inode, > > if (IS_DAX(inode)) { > /* > - * we need to clear the block > + * block must be initialised before we put it in the tree > + * so that it's not found by another thread before it's > + * initialised > */ > - err = ext2_clear_xip_target (inode, > - le32_to_cpu(chain[depth-1].key)); > + err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key), > + count << inode->i_blkbits); Umm 'count' looks wrong here. You want to clear only one block, don't you? > if (err) { > mutex_unlock(&ei->truncate_mutex); > goto cleanup; -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932718AbaDIJw7 (ORCPT ); Wed, 9 Apr 2014 05:52:59 -0400 Received: from cantor2.suse.de ([195.135.220.15]:50965 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758185AbaDIJw6 (ORCPT ); Wed, 9 Apr 2014 05:52:58 -0400 Date: Wed, 9 Apr 2014 11:52:54 +0200 From: Jan Kara To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com Subject: Re: [PATCH v7 12/22] ext2: Remove ext2_xip_verify_sb() Message-ID: <20140409095254.GE32103@quack.suse.cz> References: <5f91cb658e1ee1b593be9fd719e8f204b0069031.1395591795.git.matthew.r.wilcox@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5f91cb658e1ee1b593be9fd719e8f204b0069031.1395591795.git.matthew.r.wilcox@intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 23-03-14 15:08:38, Matthew Wilcox wrote: > Jan Kara pointed out that calling ext2_xip_verify_sb() in ext2_remount() > doesn't make sense, since changing the XIP option on remount isn't > allowed. It also doesn't make sense to re-check whether blocksize is > supported since it can't change between mounts. > > Replace the call to ext2_xip_verify_sb() in ext2_fill_super() with the > equivalent check and delete the definition. Looks good. You can add: Reviewed-by: Jan Kara One nit below: ... > @@ -1273,22 +1275,11 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data) > sb->s_flags = (sb->s_flags & ~MS_POSIXACL) | > ((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0); > > - ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset > - EXT2_MOUNT_XIP if not */ > - > - if ((ext2_use_xip(sb)) && (sb->s_blocksize != PAGE_SIZE)) { > - ext2_msg(sb, KERN_WARNING, > - "warning: unsupported blocksize for xip"); > - err = -EINVAL; > - goto restore_opts; > - } > - > es = sbi->s_es; > - if ((sbi->s_mount_opt ^ old_mount_opt) & EXT2_MOUNT_XIP) { > + if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) { > ext2_msg(sb, KERN_WARNING, "warning: refusing change of " > "xip flag with busy inodes while remounting"); > - sbi->s_mount_opt &= ~EXT2_MOUNT_XIP; > - sbi->s_mount_opt |= old_mount_opt & EXT2_MOUNT_XIP; > + sbi->s_mount_opt ^= EXT2_MOUNT_XIP; Although this is correct, it was easier to see that the previous code is correct so I'd prefer if you kept it that way. Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932787AbaDIJzy (ORCPT ); Wed, 9 Apr 2014 05:55:54 -0400 Received: from cantor2.suse.de ([195.135.220.15]:51034 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932247AbaDIJzw (ORCPT ); Wed, 9 Apr 2014 05:55:52 -0400 Date: Wed, 9 Apr 2014 11:55:49 +0200 From: Jan Kara To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com Subject: Re: [PATCH v7 13/22] ext2: Remove ext2_use_xip Message-ID: <20140409095549.GF32103@quack.suse.cz> References: <0c65dcd599646e3054d0c524a0c5b25b07885763.1395591795.git.matthew.r.wilcox@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <0c65dcd599646e3054d0c524a0c5b25b07885763.1395591795.git.matthew.r.wilcox@intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 23-03-14 15:08:39, Matthew Wilcox wrote: > Replace ext2_use_xip() with test_opt(XIP) which expands to the same code Looks good. You can add: Reviewed-by: Jan Kara Honza > > Signed-off-by: Matthew Wilcox > --- > fs/ext2/ext2.h | 4 ++++ > fs/ext2/inode.c | 2 +- > fs/ext2/namei.c | 4 ++-- > 3 files changed, 7 insertions(+), 3 deletions(-) > > diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h > index d9a17d0..5ecf570 100644 > --- a/fs/ext2/ext2.h > +++ b/fs/ext2/ext2.h > @@ -380,7 +380,11 @@ struct ext2_inode { > #define EXT2_MOUNT_NO_UID32 0x000200 /* Disable 32-bit UIDs */ > #define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */ > #define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */ > +#ifdef CONFIG_FS_XIP > #define EXT2_MOUNT_XIP 0x010000 /* Execute in place */ > +#else > +#define EXT2_MOUNT_XIP 0 > +#endif > #define EXT2_MOUNT_USRQUOTA 0x020000 /* user quota */ > #define EXT2_MOUNT_GRPQUOTA 0x040000 /* group quota */ > #define EXT2_MOUNT_RESERVATION 0x080000 /* Preallocation */ > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index a9346a9..2e587e2 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -1393,7 +1393,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino) > > if (S_ISREG(inode->i_mode)) { > inode->i_op = &ext2_file_inode_operations; > - if (ext2_use_xip(inode->i_sb)) { > + if (test_opt(inode->i_sb, XIP)) { > inode->i_mapping->a_ops = &ext2_aops_xip; > inode->i_fop = &ext2_xip_file_operations; > } else if (test_opt(inode->i_sb, NOBH)) { > diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c > index c268d0a..846c356 100644 > --- a/fs/ext2/namei.c > +++ b/fs/ext2/namei.c > @@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode > return PTR_ERR(inode); > > inode->i_op = &ext2_file_inode_operations; > - if (ext2_use_xip(inode->i_sb)) { > + if (test_opt(inode->i_sb, XIP)) { > inode->i_mapping->a_ops = &ext2_aops_xip; > inode->i_fop = &ext2_xip_file_operations; > } else if (test_opt(inode->i_sb, NOBH)) { > @@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) > return PTR_ERR(inode); > > inode->i_op = &ext2_file_inode_operations; > - if (ext2_use_xip(inode->i_sb)) { > + if (test_opt(inode->i_sb, XIP)) { > inode->i_mapping->a_ops = &ext2_aops_xip; > inode->i_fop = &ext2_xip_file_operations; > } else if (test_opt(inode->i_sb, NOBH)) { > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932855AbaDIJ7X (ORCPT ); Wed, 9 Apr 2014 05:59:23 -0400 Received: from cantor2.suse.de ([195.135.220.15]:51199 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758243AbaDIJ7V (ORCPT ); Wed, 9 Apr 2014 05:59:21 -0400 Date: Wed, 9 Apr 2014 11:59:18 +0200 From: Jan Kara To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com Subject: Re: [PATCH v7 15/22] Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX Message-ID: <20140409095918.GG32103@quack.suse.cz> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 23-03-14 15:08:41, Matthew Wilcox wrote: > The fewer Kconfig options we have the better. Use the generic > CONFIG_FS_DAX to enable XIP support in ext2 as well as in the core. > > Signed-off-by: Matthew Wilcox Looks good. You can add: Reviewed-by: Jan Kara BTW: Its really only 2KB of code? Honza > --- > fs/Kconfig | 21 ++++++++++++++------- > fs/Makefile | 2 +- > fs/ext2/Kconfig | 11 ----------- > fs/ext2/ext2.h | 2 +- > fs/ext2/file.c | 4 ++-- > fs/ext2/super.c | 4 ++-- > include/linux/fs.h | 4 ++-- > 7 files changed, 22 insertions(+), 26 deletions(-) > > diff --git a/fs/Kconfig b/fs/Kconfig > index 7385e54..620ab73 100644 > --- a/fs/Kconfig > +++ b/fs/Kconfig > @@ -13,13 +13,6 @@ if BLOCK > source "fs/ext2/Kconfig" > source "fs/ext3/Kconfig" > source "fs/ext4/Kconfig" > - > -config FS_XIP > -# execute in place > - bool > - depends on EXT2_FS_XIP > - default y > - > source "fs/jbd/Kconfig" > source "fs/jbd2/Kconfig" > > @@ -40,6 +33,20 @@ source "fs/ocfs2/Kconfig" > source "fs/btrfs/Kconfig" > source "fs/nilfs2/Kconfig" > > +config FS_DAX > + bool "Direct Access support" > + depends on MMU > + help > + Direct Access (DAX) can be used on memory-backed block devices. > + If the block device supports DAX and the filesystem supports DAX, > + then you can avoid using the pagecache to buffer I/Os. Turning > + on this option will compile in support for DAX; you will need to > + mount the filesystem using the -o xip option. > + > + If you do not have a block device that is capable of using this, > + or if unsure, say N. Saying Y will increase the size of the kernel > + by about 2kB. > + > endif # BLOCK > > # Posix ACL utility routines > diff --git a/fs/Makefile b/fs/Makefile > index 2f194cd..b7e0a13 100644 > --- a/fs/Makefile > +++ b/fs/Makefile > @@ -29,7 +29,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o > obj-$(CONFIG_TIMERFD) += timerfd.o > obj-$(CONFIG_EVENTFD) += eventfd.o > obj-$(CONFIG_AIO) += aio.o > -obj-$(CONFIG_FS_XIP) += dax.o > +obj-$(CONFIG_FS_DAX) += dax.o > obj-$(CONFIG_FILE_LOCKING) += locks.o > obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o > obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout.o > diff --git a/fs/ext2/Kconfig b/fs/ext2/Kconfig > index 14a6780..c634874e 100644 > --- a/fs/ext2/Kconfig > +++ b/fs/ext2/Kconfig > @@ -42,14 +42,3 @@ config EXT2_FS_SECURITY > > If you are not using a security module that requires using > extended attributes for file security labels, say N. > - > -config EXT2_FS_XIP > - bool "Ext2 execute in place support" > - depends on EXT2_FS && MMU > - help > - Execute in place can be used on memory-backed block devices. If you > - enable this option, you can select to mount block devices which are > - capable of this feature without using the page cache. > - > - If you do not use a block device that is capable of using this, > - or if unsure, say N. > diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h > index 5ecf570..b30c3bd 100644 > --- a/fs/ext2/ext2.h > +++ b/fs/ext2/ext2.h > @@ -380,7 +380,7 @@ struct ext2_inode { > #define EXT2_MOUNT_NO_UID32 0x000200 /* Disable 32-bit UIDs */ > #define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */ > #define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */ > -#ifdef CONFIG_FS_XIP > +#ifdef CONFIG_FS_DAX > #define EXT2_MOUNT_XIP 0x010000 /* Execute in place */ > #else > #define EXT2_MOUNT_XIP 0 > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > index e3ce10d..ae7f000 100644 > --- a/fs/ext2/file.c > +++ b/fs/ext2/file.c > @@ -25,7 +25,7 @@ > #include "xattr.h" > #include "acl.h" > > -#ifdef CONFIG_EXT2_FS_XIP > +#ifdef CONFIG_FS_DAX > static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf) > { > return dax_fault(vma, vmf, ext2_get_block); > @@ -109,7 +109,7 @@ const struct file_operations ext2_file_operations = { > .splice_write = generic_file_splice_write, > }; > > -#ifdef CONFIG_EXT2_FS_XIP > +#ifdef CONFIG_FS_DAX > const struct file_operations ext2_xip_file_operations = { > .llseek = generic_file_llseek, > .read = do_sync_read, > diff --git a/fs/ext2/super.c b/fs/ext2/super.c > index 752ccb4..fdcacf7 100644 > --- a/fs/ext2/super.c > +++ b/fs/ext2/super.c > @@ -287,7 +287,7 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root) > seq_puts(seq, ",grpquota"); > #endif > > -#if defined(CONFIG_EXT2_FS_XIP) > +#ifdef CONFIG_FS_DAX > if (sbi->s_mount_opt & EXT2_MOUNT_XIP) > seq_puts(seq, ",xip"); > #endif > @@ -549,7 +549,7 @@ static int parse_options(char *options, struct super_block *sb) > break; > #endif > case Opt_xip: > -#ifdef CONFIG_EXT2_FS_XIP > +#ifdef CONFIG_FS_DAX > set_opt (sbi->s_mount_opt, XIP); > #else > ext2_msg(sb, KERN_INFO, "xip option not supported"); > diff --git a/include/linux/fs.h b/include/linux/fs.h > index aeab3fda..bff394d 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1681,7 +1681,7 @@ struct super_operations { > #define IS_IMA(inode) ((inode)->i_flags & S_IMA) > #define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT) > #define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC) > -#ifdef CONFIG_FS_XIP > +#ifdef CONFIG_FS_DAX > #define IS_DAX(inode) ((inode)->i_flags & S_DAX) > #else > #define IS_DAX(inode) 0 > @@ -2519,7 +2519,7 @@ extern loff_t fixed_size_llseek(struct file *file, loff_t offset, > extern int generic_file_open(struct inode * inode, struct file * filp); > extern int nonseekable_open(struct inode * inode, struct file * filp); > > -#ifdef CONFIG_FS_XIP > +#ifdef CONFIG_FS_DAX > int dax_clear_blocks(struct inode *, sector_t block, long size); > int dax_truncate_page(struct inode *, loff_t from, get_block_t); > ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *, > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932877AbaDIJ7q (ORCPT ); Wed, 9 Apr 2014 05:59:46 -0400 Received: from cantor2.suse.de ([195.135.220.15]:51215 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758185AbaDIJ7o (ORCPT ); Wed, 9 Apr 2014 05:59:44 -0400 Date: Wed, 9 Apr 2014 11:59:41 +0200 From: Jan Kara To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com Subject: Re: [PATCH v7 14/22] ext2: Remove xip.c and xip.h Message-ID: <20140409095941.GH32103@quack.suse.cz> References: <33ff0862f6d99b352429ef4494817544c3d5da68.1395591795.git.matthew.r.wilcox@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <33ff0862f6d99b352429ef4494817544c3d5da68.1395591795.git.matthew.r.wilcox@intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 23-03-14 15:08:40, Matthew Wilcox wrote: > These files are now empty, so delete them Looks good, you can add: Reviewed-by: Jan Kara Honza > > Signed-off-by: Matthew Wilcox > --- > fs/ext2/Makefile | 1 - > fs/ext2/inode.c | 1 - > fs/ext2/namei.c | 1 - > fs/ext2/super.c | 1 - > fs/ext2/xip.c | 15 --------------- > fs/ext2/xip.h | 16 ---------------- > 6 files changed, 35 deletions(-) > delete mode 100644 fs/ext2/xip.c > delete mode 100644 fs/ext2/xip.h > > diff --git a/fs/ext2/Makefile b/fs/ext2/Makefile > index f42af45..445b0e9 100644 > --- a/fs/ext2/Makefile > +++ b/fs/ext2/Makefile > @@ -10,4 +10,3 @@ ext2-y := balloc.o dir.o file.o ialloc.o inode.o \ > ext2-$(CONFIG_EXT2_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o > ext2-$(CONFIG_EXT2_FS_POSIX_ACL) += acl.o > ext2-$(CONFIG_EXT2_FS_SECURITY) += xattr_security.o > -ext2-$(CONFIG_EXT2_FS_XIP) += xip.o > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index 2e587e2..67124f0 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -34,7 +34,6 @@ > #include > #include "ext2.h" > #include "acl.h" > -#include "xip.h" > #include "xattr.h" > > static int __ext2_write_inode(struct inode *inode, int do_sync); > diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c > index 846c356..7ca803f 100644 > --- a/fs/ext2/namei.c > +++ b/fs/ext2/namei.c > @@ -35,7 +35,6 @@ > #include "ext2.h" > #include "xattr.h" > #include "acl.h" > -#include "xip.h" > > static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode) > { > diff --git a/fs/ext2/super.c b/fs/ext2/super.c > index 3a1db39..752ccb4 100644 > --- a/fs/ext2/super.c > +++ b/fs/ext2/super.c > @@ -35,7 +35,6 @@ > #include "ext2.h" > #include "xattr.h" > #include "acl.h" > -#include "xip.h" > > static void ext2_sync_super(struct super_block *sb, > struct ext2_super_block *es, int wait); > diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c > deleted file mode 100644 > index 66ca113..0000000 > --- a/fs/ext2/xip.c > +++ /dev/null > @@ -1,15 +0,0 @@ > -/* > - * linux/fs/ext2/xip.c > - * > - * Copyright (C) 2005 IBM Corporation > - * Author: Carsten Otte (cotte@de.ibm.com) > - */ > - > -#include > -#include > -#include > -#include > -#include > -#include "ext2.h" > -#include "xip.h" > - > diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h > deleted file mode 100644 > index 87eeb04..0000000 > --- a/fs/ext2/xip.h > +++ /dev/null > @@ -1,16 +0,0 @@ > -/* > - * linux/fs/ext2/xip.h > - * > - * Copyright (C) 2005 IBM Corporation > - * Author: Carsten Otte (cotte@de.ibm.com) > - */ > - > -#ifdef CONFIG_EXT2_FS_XIP > -static inline int ext2_use_xip (struct super_block *sb) > -{ > - struct ext2_sb_info *sbi = EXT2_SB(sb); > - return (sbi->s_mount_opt & EXT2_MOUNT_XIP); > -} > -#else > -#define ext2_use_xip(sb) 0 > -#endif > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932926AbaDIKCh (ORCPT ); Wed, 9 Apr 2014 06:02:37 -0400 Received: from cantor2.suse.de ([195.135.220.15]:51309 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932187AbaDIKCf (ORCPT ); Wed, 9 Apr 2014 06:02:35 -0400 Date: Wed, 9 Apr 2014 12:02:33 +0200 From: Jan Kara To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com Subject: Re: [PATCH v7 16/22] ext2: Remove ext2_aops_xip Message-ID: <20140409100233.GI32103@quack.suse.cz> References: <0b6512aa46a504459f41d3c609fc20c93d4a911a.1395591795.git.matthew.r.wilcox@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <0b6512aa46a504459f41d3c609fc20c93d4a911a.1395591795.git.matthew.r.wilcox@intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 23-03-14 15:08:42, Matthew Wilcox wrote: > We shouldn't need a special address_space_operations any more > > Signed-off-by: Matthew Wilcox Looks good. You can add: Reviewed-by: Jan Kara Honza > --- > fs/ext2/ext2.h | 1 - > fs/ext2/inode.c | 7 +------ > fs/ext2/namei.c | 4 ++-- > 3 files changed, 3 insertions(+), 9 deletions(-) > > diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h > index b30c3bd..b8b1c11 100644 > --- a/fs/ext2/ext2.h > +++ b/fs/ext2/ext2.h > @@ -793,7 +793,6 @@ extern const struct file_operations ext2_xip_file_operations; > > /* inode.c */ > extern const struct address_space_operations ext2_aops; > -extern const struct address_space_operations ext2_aops_xip; > extern const struct address_space_operations ext2_nobh_aops; > > /* namei.c */ > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index 67124f0..7ca76da 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -890,11 +890,6 @@ const struct address_space_operations ext2_aops = { > .error_remove_page = generic_error_remove_page, > }; > > -const struct address_space_operations ext2_aops_xip = { > - .bmap = ext2_bmap, > - .direct_IO = ext2_direct_IO, > -}; > - > const struct address_space_operations ext2_nobh_aops = { > .readpage = ext2_readpage, > .readpages = ext2_readpages, > @@ -1393,7 +1388,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino) > if (S_ISREG(inode->i_mode)) { > inode->i_op = &ext2_file_inode_operations; > if (test_opt(inode->i_sb, XIP)) { > - inode->i_mapping->a_ops = &ext2_aops_xip; > + inode->i_mapping->a_ops = &ext2_aops; > inode->i_fop = &ext2_xip_file_operations; > } else if (test_opt(inode->i_sb, NOBH)) { > inode->i_mapping->a_ops = &ext2_nobh_aops; > diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c > index 7ca803f..0db888c 100644 > --- a/fs/ext2/namei.c > +++ b/fs/ext2/namei.c > @@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode > > inode->i_op = &ext2_file_inode_operations; > if (test_opt(inode->i_sb, XIP)) { > - inode->i_mapping->a_ops = &ext2_aops_xip; > + inode->i_mapping->a_ops = &ext2_aops; > inode->i_fop = &ext2_xip_file_operations; > } else if (test_opt(inode->i_sb, NOBH)) { > inode->i_mapping->a_ops = &ext2_nobh_aops; > @@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) > > inode->i_op = &ext2_file_inode_operations; > if (test_opt(inode->i_sb, XIP)) { > - inode->i_mapping->a_ops = &ext2_aops_xip; > + inode->i_mapping->a_ops = &ext2_aops; > inode->i_fop = &ext2_xip_file_operations; > } else if (test_opt(inode->i_sb, NOBH)) { > inode->i_mapping->a_ops = &ext2_nobh_aops; > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932844AbaDIKEk (ORCPT ); Wed, 9 Apr 2014 06:04:40 -0400 Received: from cantor2.suse.de ([195.135.220.15]:51366 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932247AbaDIKEj (ORCPT ); Wed, 9 Apr 2014 06:04:39 -0400 Date: Wed, 9 Apr 2014 12:04:35 +0200 From: Jan Kara To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com Subject: Re: [PATCH v7 17/22] Get rid of most mentions of XIP in ext2 Message-ID: <20140409100435.GJ32103@quack.suse.cz> References: <0b13a744db9bfca33938bc1576f7eb7bfc9c41c2.1395591795.git.matthew.r.wilcox@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <0b13a744db9bfca33938bc1576f7eb7bfc9c41c2.1395591795.git.matthew.r.wilcox@intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 23-03-14 15:08:43, Matthew Wilcox wrote: > The only remaining usage is userspace's 'xip' option. Looks good. You can add: Reviewed-by: Jan Kara Honza > --- > fs/ext2/ext2.h | 6 +++--- > fs/ext2/file.c | 2 +- > fs/ext2/inode.c | 6 +++--- > fs/ext2/namei.c | 8 ++++---- > fs/ext2/super.c | 16 ++++++++-------- > 5 files changed, 19 insertions(+), 19 deletions(-) > > diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h > index b8b1c11..0e1fe9d 100644 > --- a/fs/ext2/ext2.h > +++ b/fs/ext2/ext2.h > @@ -381,9 +381,9 @@ struct ext2_inode { > #define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */ > #define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */ > #ifdef CONFIG_FS_DAX > -#define EXT2_MOUNT_XIP 0x010000 /* Execute in place */ > +#define EXT2_MOUNT_DAX 0x010000 /* Direct Access */ > #else > -#define EXT2_MOUNT_XIP 0 > +#define EXT2_MOUNT_DAX 0 > #endif > #define EXT2_MOUNT_USRQUOTA 0x020000 /* user quota */ > #define EXT2_MOUNT_GRPQUOTA 0x040000 /* group quota */ > @@ -789,7 +789,7 @@ extern int ext2_fsync(struct file *file, loff_t start, loff_t end, > int datasync); > extern const struct inode_operations ext2_file_inode_operations; > extern const struct file_operations ext2_file_operations; > -extern const struct file_operations ext2_xip_file_operations; > +extern const struct file_operations ext2_dax_file_operations; > > /* inode.c */ > extern const struct address_space_operations ext2_aops; > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > index ae7f000..f9bcb9b 100644 > --- a/fs/ext2/file.c > +++ b/fs/ext2/file.c > @@ -110,7 +110,7 @@ const struct file_operations ext2_file_operations = { > }; > > #ifdef CONFIG_FS_DAX > -const struct file_operations ext2_xip_file_operations = { > +const struct file_operations ext2_dax_file_operations = { > .llseek = generic_file_llseek, > .read = do_sync_read, > .write = do_sync_write, > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index 7ca76da..3776063 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -1285,7 +1285,7 @@ void ext2_set_inode_flags(struct inode *inode) > inode->i_flags |= S_NOATIME; > if (flags & EXT2_DIRSYNC_FL) > inode->i_flags |= S_DIRSYNC; > - if (test_opt(inode->i_sb, XIP)) > + if (test_opt(inode->i_sb, DAX)) > inode->i_flags |= S_DAX; > } > > @@ -1387,9 +1387,9 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino) > > if (S_ISREG(inode->i_mode)) { > inode->i_op = &ext2_file_inode_operations; > - if (test_opt(inode->i_sb, XIP)) { > + if (test_opt(inode->i_sb, DAX)) { > inode->i_mapping->a_ops = &ext2_aops; > - inode->i_fop = &ext2_xip_file_operations; > + inode->i_fop = &ext2_dax_file_operations; > } else if (test_opt(inode->i_sb, NOBH)) { > inode->i_mapping->a_ops = &ext2_nobh_aops; > inode->i_fop = &ext2_file_operations; > diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c > index 0db888c..148f6e3 100644 > --- a/fs/ext2/namei.c > +++ b/fs/ext2/namei.c > @@ -104,9 +104,9 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode > return PTR_ERR(inode); > > inode->i_op = &ext2_file_inode_operations; > - if (test_opt(inode->i_sb, XIP)) { > + if (test_opt(inode->i_sb, DAX)) { > inode->i_mapping->a_ops = &ext2_aops; > - inode->i_fop = &ext2_xip_file_operations; > + inode->i_fop = &ext2_dax_file_operations; > } else if (test_opt(inode->i_sb, NOBH)) { > inode->i_mapping->a_ops = &ext2_nobh_aops; > inode->i_fop = &ext2_file_operations; > @@ -125,9 +125,9 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) > return PTR_ERR(inode); > > inode->i_op = &ext2_file_inode_operations; > - if (test_opt(inode->i_sb, XIP)) { > + if (test_opt(inode->i_sb, DAX)) { > inode->i_mapping->a_ops = &ext2_aops; > - inode->i_fop = &ext2_xip_file_operations; > + inode->i_fop = &ext2_dax_file_operations; > } else if (test_opt(inode->i_sb, NOBH)) { > inode->i_mapping->a_ops = &ext2_nobh_aops; > inode->i_fop = &ext2_file_operations; > diff --git a/fs/ext2/super.c b/fs/ext2/super.c > index fdcacf7..8062373 100644 > --- a/fs/ext2/super.c > +++ b/fs/ext2/super.c > @@ -288,7 +288,7 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root) > #endif > > #ifdef CONFIG_FS_DAX > - if (sbi->s_mount_opt & EXT2_MOUNT_XIP) > + if (sbi->s_mount_opt & EXT2_MOUNT_DAX) > seq_puts(seq, ",xip"); > #endif > > @@ -393,7 +393,7 @@ enum { > Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic, > Opt_err_ro, Opt_nouid32, Opt_nocheck, Opt_debug, > Opt_oldalloc, Opt_orlov, Opt_nobh, Opt_user_xattr, Opt_nouser_xattr, > - Opt_acl, Opt_noacl, Opt_xip, Opt_ignore, Opt_err, Opt_quota, > + Opt_acl, Opt_noacl, Opt_dax, Opt_ignore, Opt_err, Opt_quota, > Opt_usrquota, Opt_grpquota, Opt_reservation, Opt_noreservation > }; > > @@ -421,7 +421,7 @@ static const match_table_t tokens = { > {Opt_nouser_xattr, "nouser_xattr"}, > {Opt_acl, "acl"}, > {Opt_noacl, "noacl"}, > - {Opt_xip, "xip"}, > + {Opt_dax, "xip"}, > {Opt_grpquota, "grpquota"}, > {Opt_ignore, "noquota"}, > {Opt_quota, "quota"}, > @@ -548,9 +548,9 @@ static int parse_options(char *options, struct super_block *sb) > "(no)acl options not supported"); > break; > #endif > - case Opt_xip: > + case Opt_dax: > #ifdef CONFIG_FS_DAX > - set_opt (sbi->s_mount_opt, XIP); > + set_opt (sbi->s_mount_opt, DAX); > #else > ext2_msg(sb, KERN_INFO, "xip option not supported"); > #endif > @@ -896,7 +896,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) > > blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size); > > - if (sbi->s_mount_opt & EXT2_MOUNT_XIP) { > + if (sbi->s_mount_opt & EXT2_MOUNT_DAX) { > if (blocksize != PAGE_SIZE) { > ext2_msg(sb, KERN_ERR, > "error: unsupported blocksize for xip"); > @@ -1275,10 +1275,10 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data) > ((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0); > > es = sbi->s_es; > - if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) { > + if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_DAX) { > ext2_msg(sb, KERN_WARNING, "warning: refusing change of " > "xip flag with busy inodes while remounting"); > - sbi->s_mount_opt ^= EXT2_MOUNT_XIP; > + sbi->s_mount_opt ^= EXT2_MOUNT_DAX; > } > if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) { > spin_unlock(&sbi->s_lock); > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758404AbaDIKHO (ORCPT ); Wed, 9 Apr 2014 06:07:14 -0400 Received: from cantor2.suse.de ([195.135.220.15]:51404 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757481AbaDIKHL (ORCPT ); Wed, 9 Apr 2014 06:07:11 -0400 Date: Wed, 9 Apr 2014 12:07:09 +0200 From: Jan Kara To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Matthew Wilcox Subject: Re: [PATCH v7 22/22] brd: Rename XIP to DAX Message-ID: <20140409100709.GK32103@quack.suse.cz> References: <7fd74703525f4077ed7c2b273ce6d082b03f0b61.1395591795.git.matthew.r.wilcox@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <7fd74703525f4077ed7c2b273ce6d082b03f0b61.1395591795.git.matthew.r.wilcox@intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 23-03-14 15:08:48, Matthew Wilcox wrote: > From: Matthew Wilcox > > Since this is relating to FS_XIP, not KERNEL_XIP, it should be called > DAX instead of XIP. > > Signed-off-by: Matthew Wilcox Looks good. You can add: Reviewed-by: Jan Kara Honza > --- > drivers/block/Kconfig | 13 +++++++------ > drivers/block/brd.c | 14 +++++++------- > fs/Kconfig | 4 ++-- > 3 files changed, 16 insertions(+), 15 deletions(-) > > diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig > index 014a1cf..1b8094d 100644 > --- a/drivers/block/Kconfig > +++ b/drivers/block/Kconfig > @@ -393,14 +393,15 @@ config BLK_DEV_RAM_SIZE > The default value is 4096 kilobytes. Only change this if you know > what you are doing. > > -config BLK_DEV_XIP > - bool "Support XIP filesystems on RAM block device" > - depends on BLK_DEV_RAM > +config BLK_DEV_RAM_DAX > + bool "Support Direct Access (DAX) to RAM block devices" > + depends on BLK_DEV_RAM && FS_DAX > default n > help > - Support XIP filesystems (such as ext2 with XIP support on) on > - top of block ram device. This will slightly enlarge the kernel, and > - will prevent RAM block device backing store memory from being > + Support filesystems using DAX to access RAM block devices. This > + avoids double-buffering data in the page cache before copying it > + to the block device. Answering Y will slightly enlarge the kernel, > + and will prevent RAM block device backing store memory from being > allocated from highmem (only a problem for highmem systems). > > config CDROM_PKTCDVD > diff --git a/drivers/block/brd.c b/drivers/block/brd.c > index 00da60d..619e0e0 100644 > --- a/drivers/block/brd.c > +++ b/drivers/block/brd.c > @@ -97,13 +97,13 @@ static struct page *brd_insert_page(struct brd_device *brd, sector_t sector) > * Must use NOIO because we don't want to recurse back into the > * block or filesystem layers from page reclaim. > * > - * Cannot support XIP and highmem, because our ->direct_access > - * routine for XIP must return memory that is always addressable. > - * If XIP was reworked to use pfns and kmap throughout, this > + * Cannot support DAX and highmem, because our ->direct_access > + * routine for DAX must return memory that is always addressable. > + * If DAX was reworked to use pfns and kmap throughout, this > * restriction might be able to be lifted. > */ > gfp_flags = GFP_NOIO | __GFP_ZERO; > -#ifndef CONFIG_BLK_DEV_XIP > +#ifndef CONFIG_BLK_DEV_RAM_DAX > gfp_flags |= __GFP_HIGHMEM; > #endif > page = alloc_page(gfp_flags); > @@ -360,7 +360,7 @@ out: > bio_endio(bio, err); > } > > -#ifdef CONFIG_BLK_DEV_XIP > +#ifdef CONFIG_BLK_DEV_RAM_DAX > static long brd_direct_access(struct block_device *bdev, sector_t sector, > void **kaddr, unsigned long *pfn, long size) > { > @@ -383,6 +383,8 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector, > * file is mapped to the next page of physical RAM */ > return PAGE_SIZE; > } > +#else > +#define brd_direct_access NULL > #endif > > static int brd_ioctl(struct block_device *bdev, fmode_t mode, > @@ -422,9 +424,7 @@ static int brd_ioctl(struct block_device *bdev, fmode_t mode, > static const struct block_device_operations brd_fops = { > .owner = THIS_MODULE, > .ioctl = brd_ioctl, > -#ifdef CONFIG_BLK_DEV_XIP > .direct_access = brd_direct_access, > -#endif > }; > > /* > diff --git a/fs/Kconfig b/fs/Kconfig > index 620ab73..376bd0a 100644 > --- a/fs/Kconfig > +++ b/fs/Kconfig > @@ -34,7 +34,7 @@ source "fs/btrfs/Kconfig" > source "fs/nilfs2/Kconfig" > > config FS_DAX > - bool "Direct Access support" > + bool "Direct Access (DAX) support" > depends on MMU > help > Direct Access (DAX) can be used on memory-backed block devices. > @@ -45,7 +45,7 @@ config FS_DAX > > If you do not have a block device that is capable of using this, > or if unsure, say N. Saying Y will increase the size of the kernel > - by about 2kB. > + by about 5kB. > > endif # BLOCK > > -- > 1.9.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932833AbaDIK2G (ORCPT ); Wed, 9 Apr 2014 06:28:06 -0400 Received: from cantor2.suse.de ([195.135.220.15]:51902 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932597AbaDIK2D (ORCPT ); Wed, 9 Apr 2014 06:28:03 -0400 Date: Wed, 9 Apr 2014 12:27:58 +0200 From: Jan Kara To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Message-ID: <20140409102758.GM32103@quack.suse.cz> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org One more comment: On Sun 23-03-14 15:08:33, Matthew Wilcox wrote: > +static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > + get_block_t get_block) > +{ > + struct file *file = vma->vm_file; > + struct inode *inode = file_inode(file); > + struct address_space *mapping = file->f_mapping; > + struct page *page; > + struct buffer_head bh; > + unsigned long vaddr = (unsigned long)vmf->virtual_address; > + sector_t block; > + pgoff_t size; > + unsigned long pfn; > + int error; > + int major = 0; > + > + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; > + if (vmf->pgoff >= size) > + return VM_FAULT_SIGBUS; > + > + memset(&bh, 0, sizeof(bh)); > + block = (sector_t)vmf->pgoff << (PAGE_SHIFT - inode->i_blkbits); > + bh.b_size = PAGE_SIZE; > + > + repeat: > + page = find_get_page(mapping, vmf->pgoff); > + if (page) { > + if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) { > + page_cache_release(page); > + return VM_FAULT_RETRY; > + } > + if (unlikely(page->mapping != mapping)) { > + unlock_page(page); > + page_cache_release(page); > + goto repeat; > + } > + } > + > + error = get_block(inode, block, &bh, 0); > + if (error || bh.b_size < PAGE_SIZE) > + goto sigbus; > + > + if (!buffer_written(&bh) && !vmf->cow_page) { > + if (vmf->flags & FAULT_FLAG_WRITE) { > + error = get_block(inode, block, &bh, 1); > + count_vm_event(PGMAJFAULT); > + mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); > + major = VM_FAULT_MAJOR; > + if (error || bh.b_size < PAGE_SIZE) > + goto sigbus; > + } else { > + return dax_load_hole(mapping, page, vmf); > + } > + } > + > + /* Recheck i_size under i_mmap_mutex */ > + mutex_lock(&mapping->i_mmap_mutex); > + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; > + if (unlikely(vmf->pgoff >= size)) { > + mutex_unlock(&mapping->i_mmap_mutex); > + goto sigbus; You need to release the block you've got from the filesystem in case of error here an below. Honza > + } > + if (vmf->cow_page) { > + if (buffer_written(&bh)) > + copy_user_bh(vmf->cow_page, inode, &bh, vaddr); > + else > + clear_user_highpage(vmf->cow_page, vaddr); > + if (page) { > + unlock_page(page); > + page_cache_release(page); > + } > + /* do_cow_fault() will release the i_mmap_mutex */ > + return VM_FAULT_COWED; > + } > + > + if (buffer_unwritten(&bh) || buffer_new(&bh)) > + dax_clear_blocks(inode, bh.b_blocknr, bh.b_size); > + > + error = dax_get_pfn(inode, &bh, &pfn); > + if (error > 0) > + error = vm_insert_mixed(vma, vaddr, pfn); > + mutex_unlock(&mapping->i_mmap_mutex); > + > + if (page) { > + delete_from_page_cache(page); > + unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, > + PAGE_CACHE_SIZE, 0); > + unlock_page(page); > + page_cache_release(page); > + } > + > + if (error == -ENOMEM) > + return VM_FAULT_OOM; > + /* -EBUSY is fine, somebody else faulted on the same PTE */ > + if (error != -EBUSY) > + BUG_ON(error); > + return VM_FAULT_NOPAGE | major; > + > + sigbus: > + if (page) { > + unlock_page(page); > + page_cache_release(page); > + } > + return VM_FAULT_SIGBUS; > +} -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933245AbaDIMEp (ORCPT ); Wed, 9 Apr 2014 08:04:45 -0400 Received: from cantor2.suse.de ([195.135.220.15]:54351 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932592AbaDIMEn (ORCPT ); Wed, 9 Apr 2014 08:04:43 -0400 Date: Wed, 9 Apr 2014 14:04:37 +0200 From: Jan Kara To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com Subject: Re: [PATCH v7 06/22] Replace XIP read and write with DAX I/O Message-ID: <20140409120437.GA7715@quack.suse.cz> References: <3ebe329d8713f7db4c105021a845316a47a29797.1395591795.git.matthew.r.wilcox@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3ebe329d8713f7db4c105021a845316a47a29797.1395591795.git.matthew.r.wilcox@intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org I've noticed one more thing here: On Sun 23-03-14 15:08:32, Matthew Wilcox wrote: .... > +ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode, > + const struct iovec *iov, loff_t offset, unsigned nr_segs, > + get_block_t get_block, dio_iodone_t end_io, int flags) > +{ ... > + retval = dax_io(rw, inode, iov, offset, end, get_block, &bh); > + > + if ((flags & DIO_LOCKING) && (rw == READ)) > + mutex_unlock(&inode->i_mutex); > + > + inode_dio_done(inode); > + > + if ((retval > 0) && end_io) > + end_io(iocb, offset, retval, bh.b_private); In direct IO code, we first call end_io() callback and do inode_dio_done() only after that. Since filesystems use i_dio_count for protecting against different races, calling end_io() after inode_dio_done() can open all sorts of subtle races. Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933263AbaDIMRX (ORCPT ); Wed, 9 Apr 2014 08:17:23 -0400 Received: from cantor2.suse.de ([195.135.220.15]:54519 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932592AbaDIMRV (ORCPT ); Wed, 9 Apr 2014 08:17:21 -0400 Date: Wed, 9 Apr 2014 14:17:17 +0200 From: Jan Kara To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ross Zwisler , willy@linux.intel.com Subject: Re: [PATCH v7 20/22] ext4: Add DAX functionality Message-ID: <20140409121717.GN32103@quack.suse.cz> References: <490bf3041f0e0633964ca84bf4fb0bb3dd999694.1395591795.git.matthew.r.wilcox@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <490bf3041f0e0633964ca84bf4fb0bb3dd999694.1395591795.git.matthew.r.wilcox@intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 23-03-14 15:08:46, Matthew Wilcox wrote: > From: Ross Zwisler > > This is a port of the DAX functionality found in the current version of > ext2. > > Signed-off-by: Ross Zwisler > Reviewed-by: Andreas Dilger > [heavily tweaked] > Signed-off-by: Matthew Wilcox > --- I have some comments below. > diff --git a/fs/ext4/file.c b/fs/ext4/file.c > index 1a50739..42a8ccd 100644 > --- a/fs/ext4/file.c > +++ b/fs/ext4/file.c > @@ -190,7 +190,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov, > } > } > > - if (unlikely(iocb->ki_filp->f_flags & O_DIRECT)) > + if (io_is_direct(iocb->ki_filp)) > ret = ext4_file_dio_write(iocb, iov, nr_segs, pos); > else > ret = generic_file_aio_write(iocb, iov, nr_segs, pos); > @@ -198,6 +198,27 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov, > return ret; > } > > +#ifdef CONFIG_FS_DAX > +static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf) > +{ > + return dax_fault(vma, vmf, ext4_get_block); > + /* Is this the right get_block? */ Yes, it is the right one. > +} > + > +static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > +{ > + return dax_mkwrite(vma, vmf, ext4_get_block); > +} Umm, I'm afraid it won't be this easy here. So you rely on ext4_get_block() to start a transaction for you and do the block allocation. However if the system crashes after ext4_get_block() has allocated the block and finished the transaction but before dax_mkwrite() had a chance to zero out the page, the filesystem will be referencing block with uninitialized data when the system boots again (this is a security issue for multiuser systems). What you need to do is to start a transaction here in ext4_dax_mkwrite(), call dax_mkwrite() (ext4_get_block() will notice the transaction is already started and don't start it again so you don't have to care about that), and stop the transaction after dax_mkwrite() returns. Except it's not so easy because sb_start_pagefault() locking ranks above transaction start so ext4 will really need to call into something like do_dax_fault() - I'd suggest we create dax_mkwrite() and __dax_mkwrite() similarly to how block_page_mkwrite() and __block_page_mkwrite() from fs/buffer.c do. > + > +static const struct vm_operations_struct ext4_dax_vm_ops = { > + .fault = ext4_dax_fault, > + .page_mkwrite = ext4_dax_mkwrite, > + .remap_pages = generic_file_remap_pages, > +}; > +#else > +#define ext4_dax_vm_ops ext4_file_vm_ops > +#endif > + > static const struct vm_operations_struct ext4_file_vm_ops = { > .fault = filemap_fault, > .page_mkwrite = ext4_page_mkwrite, > @@ -206,12 +227,13 @@ static const struct vm_operations_struct ext4_file_vm_ops = { > > static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) > { > - struct address_space *mapping = file->f_mapping; > - > - if (!mapping->a_ops->readpage) > - return -ENOEXEC; > file_accessed(file); > - vma->vm_ops = &ext4_file_vm_ops; > + if (IS_DAX(file_inode(file))) { > + vma->vm_ops = &ext4_dax_vm_ops; > + vma->vm_flags |= VM_MIXEDMAP; > + } else { > + vma->vm_ops = &ext4_file_vm_ops; > + } > return 0; > } > > @@ -609,6 +631,25 @@ const struct file_operations ext4_file_operations = { > .fallocate = ext4_fallocate, > }; > > +#ifdef CONFIG_FS_DAX > +const struct file_operations ext4_dax_file_operations = { > + .llseek = ext4_llseek, > + .read = do_sync_read, > + .write = do_sync_write, > + .aio_read = generic_file_aio_read, > + .aio_write = ext4_file_write, > + .unlocked_ioctl = ext4_ioctl, > +#ifdef CONFIG_COMPAT > + .compat_ioctl = ext4_compat_ioctl, > +#endif > + .mmap = ext4_file_mmap, > + .open = ext4_file_open, > + .release = ext4_release_file, > + .fsync = ext4_sync_file, > + .fallocate = ext4_fallocate, > +}; > +#endif > + > const struct inode_operations ext4_file_inode_operations = { > .setattr = ext4_setattr, > .getattr = ext4_getattr, > diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c > index 594009f..5fdb414 100644 > --- a/fs/ext4/indirect.c > +++ b/fs/ext4/indirect.c > @@ -686,15 +686,22 @@ retry: > inode_dio_done(inode); > goto locked; > } > - ret = __blockdev_direct_IO(rw, iocb, inode, > - inode->i_sb->s_bdev, iov, > - offset, nr_segs, > - ext4_get_block, NULL, NULL, 0); > + if (IS_DAX(inode)) > + ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs, > + ext4_get_block, NULL, 0); > + else > + ret = __blockdev_direct_IO(rw, iocb, inode, > + inode->i_sb->s_bdev, iov, offset, > + nr_segs, ext4_get_block, NULL, NULL, 0); > inode_dio_done(inode); > } else { > locked: > - ret = blockdev_direct_IO(rw, iocb, inode, iov, > - offset, nr_segs, ext4_get_block); > + if (IS_DAX(inode)) > + ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs, > + ext4_get_block, NULL, DIO_LOCKING); > + else > + ret = blockdev_direct_IO(rw, iocb, inode, iov, > + offset, nr_segs, ext4_get_block); > > if (unlikely((rw & WRITE) && ret < 0)) { > loff_t isize = i_size_read(inode); > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index ce7341c..9462730 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3140,13 +3140,14 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb, > get_block_func = ext4_get_block_write; > dio_flags = DIO_LOCKING; > } > - ret = __blockdev_direct_IO(rw, iocb, inode, > - inode->i_sb->s_bdev, iov, > - offset, nr_segs, > - get_block_func, > - ext4_end_io_dio, > - NULL, > - dio_flags); > + if (IS_DAX(inode)) > + ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs, > + get_block_func, ext4_end_io_dio, dio_flags); > + else > + ret = __blockdev_direct_IO(rw, iocb, inode, > + inode->i_sb->s_bdev, iov, offset, > + nr_segs, get_block_func, > + ext4_end_io_dio, NULL, dio_flags); > Since you don't do real AIO for DAX, you could handle async iocbs for DAX inodes the same way as normal sync iocbs (i.e., you don't need to allocate ioend and do completion from a workqueue but handle everything necessary in ext4_ext_direct_IO()). That will be noticeably faster and with smaller CPU load as well. I'm not saying you have to do that now (although it shouldn't be complicated) but at least note that in a comment please. Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933498AbaDIPYd (ORCPT ); Wed, 9 Apr 2014 11:24:33 -0400 Received: from mga01.intel.com ([192.55.52.88]:30408 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932826AbaDIPYb (ORCPT ); Wed, 9 Apr 2014 11:24:31 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,826,1389772800"; d="scan'208";a="517573126" Date: Wed, 9 Apr 2014 11:19:08 -0400 From: Matthew Wilcox To: Jan Kara Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 06/22] Replace XIP read and write with DAX I/O Message-ID: <20140409151908.GD5727@linux.intel.com> References: <3ebe329d8713f7db4c105021a845316a47a29797.1395591795.git.matthew.r.wilcox@intel.com> <20140408175600.GE2713@quack.suse.cz> <20140408202102.GB5727@linux.intel.com> <20140409091450.GA32103@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140409091450.GA32103@quack.suse.cz> User-Agent: Mutt/1.5.22 (2013-10-16) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 09, 2014 at 11:14:50AM +0200, Jan Kara wrote: > On Tue 08-04-14 16:21:02, Matthew Wilcox wrote: > > On Tue, Apr 08, 2014 at 07:56:00PM +0200, Jan Kara wrote: > > > > +static void dax_new_buf(void *addr, unsigned size, unsigned first, > > > > + loff_t offset, loff_t end, int rw) > > > > +{ > > > > + loff_t final = end - offset + first; /* The final byte of the buffer */ > > > > + if (rw != WRITE) { > > > > + memset(addr, 0, size); > > > > + return; > > > > + } > > > It seems counterintuitive to zero out "on-disk" blocks (it seems you'd do > > > this for unwritten blocks) when reading from them. Presumably it could also > > > have undesired effects on endurance of persistent memory. Instead I'd expect > > > that you simply zero out user provided buffer the same way as you do it for > > > holes. > > > > I think we have to zero it here, because the second time we call > > get_block() for a given block, it won't be BH_New any more, so we won't > > know that it's supposed to be zeroed. > But how can you have BH_New buffer when you didn't ask get_blocks() to > create any block? That would be a bug in the get_blocks() implementation... > Or am I missing something? Oh ... right. So just to be clear, we're looking at the case where we're doing a read of a filesystem block which is BH_Unwritten, but isn't a hole ... so it's been allocated on storage and not yet written. That's already treated as a hole: if (rw == WRITE) { ... } else { hole = !buffer_written(bh); } and dax_new_buf is only called in the !hole case. > OK, but there are filesystems which do the same thing as ext4 (e.g. > btrfs) and historically noone really cared. E.g. direct IO code advances > only by a single block regardless of what filesystem returns when the > buffer is unmapped. As you correctly mention, get_blocks() API isn't really > documented so noone has really defined what should happen when you ask > filesystem to map some blocks and there's a hole. I agree what XFS does > looks sensible and ext4 can do the same. Hopefully this gets cleaned up > when Dave finishes his new block mapping interface. I hope so too! The get_block() API has been the bane of my existance since Christmas :-) > This wouldn't quite work because even ext4_map_blocks() doesn't bother to > fill in 'map' when it finds a hole. But it won't be complicated to > propagate the information. Good point. > > It'll be kind of tricky to move it because 'len' is not necessarily > > a multiple of i_blkbits, so we can't necessarily maintain b_blocknr > > accurately. > Yeah, after I understood the code I also understood why you do it the way > you did. But we could do something like: > ... > + if (!len) > + break; > + > blocks = ((offset + len) >> inode->i_blkbits) - > (offset >> inode->i_blkbits); > bh->b_blocknr += blocks; > bh->b_size -= blocks << inode->i_blkbits; > + offset += len; > + copied += len; > + addr += len; > ... We could ... I'm not sure it's simpler though. > BTW: it might be good to store inode->i_blkbits in a local variable. It > makes some expressions shorter. Yes, good idea. Done. > BTW2: although direct IO uses 'offset' for position in file, the rest of > VFS uses 'pos' for that and that seems to be less overloaded term so for me > it would be easier if you used 'pos' instead of 'offset'. Just a > suggestion. Sure. Done. > > > > + if (rw == WRITE) { > > > > + if (!buffer_mapped(bh)) { > > > > + retval = -EIO; > > > > + break; > > > -EIO looks like a wrong error here. Or maybe it is the right one and it > > > only needs some explanation? The thing is that for direct IO some > > > filesystems choose not to fill holes for direct IO and fall back to > > > buffered IO instead (to avoid exposure of uninitialized blocks if the > > > system crashes after blocks have been added to a file but before they were > > > written out). For DAX you are pretty much free to define what you ask from > > > the get_blocks() (and this fallback behavior is somewhat disputed behavior > > > in direct IO case so you might want to differ here) but you should document > > > it somewhere. > > > > Hmm ... I thought that calling get_block() with the create argument would > > force the return of a bh with the Mapped bit set. Did I misunderstand that > > aspect of the undocumented get_block() API too? > As you mention the API is undocumented and not really designed. So > filesystems do whatever causes the generic code to do what they want (it's > a mess I know). In this case, I'm warning you there are filesystems which > refuse to fill in holes from the get_blocks() function passed to > blockdev_direct_IO() (even ext4 does this for inodes with old > indirect-block based on disk format). You can just define DAX fails > horribly in these case and I'm fine with that at least in this stage. If > someone bothers later, fallback to buffered IO can be implemented. But we > should document this somewhere. Urgh. Yeah, we should probably fall back to buffered I/O for that case. I'll stick a comment in dax.c for now, and we can fix it later. > > > > + if ((flags & DIO_LOCKING) && (rw == READ)) { > > > > + struct address_space *mapping = inode->i_mapping; > > > > + mutex_lock(&inode->i_mutex); > > > > + retval = filemap_write_and_wait_range(mapping, offset, end - 1); > > > > + if (retval) { > > > > + mutex_unlock(&inode->i_mutex); > > > > + goto out; > > > > + } > > > Is there a reason for this? I'd assume DAX has no pages in pagecache... > > > > There will be pages in the page cache for holes that we page faulted on. > > They must go! :-) > Well, but this will only writeback dirty pages and if I read the code > correctly those pages will never be dirty since dax_mkwrite() will replace > them. Or am I missing something? In addition to writing back dirty pages, filemap_write_and_wait_range() will evict clean pages. Unintuitive, I know, but it matches what the direct I/O path does. Plus, if we fall back to buffered I/O for holes (see above), then this will do the right thing at that time. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964985AbaDIUwY (ORCPT ); Wed, 9 Apr 2014 16:52:24 -0400 Received: from mga11.intel.com ([192.55.52.93]:9252 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933590AbaDIUwX (ORCPT ); Wed, 9 Apr 2014 16:52:23 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,828,1389772800"; d="scan'208";a="510390472" Date: Wed, 9 Apr 2014 16:51:11 -0400 From: Matthew Wilcox To: Jan Kara Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Message-ID: <20140409205111.GG5727@linux.intel.com> References: <20140409102758.GM32103@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140409102758.GM32103@quack.suse.cz> User-Agent: Mutt/1.5.22 (2013-10-16) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 09, 2014 at 12:27:58PM +0200, Jan Kara wrote: > > + if (unlikely(vmf->pgoff >= size)) { > > + mutex_unlock(&mapping->i_mmap_mutex); > > + goto sigbus; > You need to release the block you've got from the filesystem in case of > error here an below. What's the API to do that? Call inode->i_op->setattr()? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934351AbaDIUzd (ORCPT ); Wed, 9 Apr 2014 16:55:33 -0400 Received: from cantor2.suse.de ([195.135.220.15]:36576 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965010AbaDIUzb (ORCPT ); Wed, 9 Apr 2014 16:55:31 -0400 Date: Wed, 9 Apr 2014 22:55:29 +0200 From: Jan Kara To: Matthew Wilcox Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 06/22] Replace XIP read and write with DAX I/O Message-ID: <20140409205529.GO32103@quack.suse.cz> References: <3ebe329d8713f7db4c105021a845316a47a29797.1395591795.git.matthew.r.wilcox@intel.com> <20140408175600.GE2713@quack.suse.cz> <20140408202102.GB5727@linux.intel.com> <20140409091450.GA32103@quack.suse.cz> <20140409151908.GD5727@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140409151908.GD5727@linux.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 09-04-14 11:19:08, Matthew Wilcox wrote: > On Wed, Apr 09, 2014 at 11:14:50AM +0200, Jan Kara wrote: > > On Tue 08-04-14 16:21:02, Matthew Wilcox wrote: > > > On Tue, Apr 08, 2014 at 07:56:00PM +0200, Jan Kara wrote: > > > > > +static void dax_new_buf(void *addr, unsigned size, unsigned first, > > > > > + loff_t offset, loff_t end, int rw) > > > > > +{ > > > > > + loff_t final = end - offset + first; /* The final byte of the buffer */ > > > > > + if (rw != WRITE) { > > > > > + memset(addr, 0, size); > > > > > + return; > > > > > + } > > > > It seems counterintuitive to zero out "on-disk" blocks (it seems you'd do > > > > this for unwritten blocks) when reading from them. Presumably it could also > > > > have undesired effects on endurance of persistent memory. Instead I'd expect > > > > that you simply zero out user provided buffer the same way as you do it for > > > > holes. > > > > > > I think we have to zero it here, because the second time we call > > > get_block() for a given block, it won't be BH_New any more, so we won't > > > know that it's supposed to be zeroed. > > But how can you have BH_New buffer when you didn't ask get_blocks() to > > create any block? That would be a bug in the get_blocks() implementation... > > Or am I missing something? > > Oh ... right. So just to be clear, we're looking at the case where > we're doing a read of a filesystem block which is BH_Unwritten, but > isn't a hole ... so it's been allocated on storage and not yet written. > That's already treated as a hole: > > if (rw == WRITE) { > ... > } else { > hole = !buffer_written(bh); > } > > and dax_new_buf is only called in the !hole case. Ah, my bad. But then dax_new_buf() won't ever be called for rw != WRITE. get_blocks() cannot ever return BH_New buffer when 'create' argument was 0. > > > > > + if ((flags & DIO_LOCKING) && (rw == READ)) { > > > > > + struct address_space *mapping = inode->i_mapping; > > > > > + mutex_lock(&inode->i_mutex); > > > > > + retval = filemap_write_and_wait_range(mapping, offset, end - 1); > > > > > + if (retval) { > > > > > + mutex_unlock(&inode->i_mutex); > > > > > + goto out; > > > > > + } > > > > Is there a reason for this? I'd assume DAX has no pages in pagecache... > > > > > > There will be pages in the page cache for holes that we page faulted on. > > > They must go! :-) > > Well, but this will only writeback dirty pages and if I read the code > > correctly those pages will never be dirty since dax_mkwrite() will replace > > them. Or am I missing something? > > In addition to writing back dirty pages, filemap_write_and_wait_range() > will evict clean pages. Unintuitive, I know, but it matches what the > direct I/O path does. Plus, if we fall back to buffered I/O for holes > (see above), then this will do the right thing at that time. Ugh, I'm pretty certain filemap_write_and_wait_range() doesn't evict anything ;). Direct IO path calls that function so that direct IO read after buffered write returns the written data. In that case we don't evict anything from page cache because direct IO read doesn't invalidate any information we have cached. Only direct IO write does that and for that we call invalidate_inode_pages2_range() after writing the pages. So I maintain that what you do doesn't make sense to me. You might need to do some invalidation of hole pages. But note that generic_file_direct_write() does that for you and even though that isn't serialized in any way with page faults which can instantiate the hole pages again, things should work out fine for you since that function also invalidates the range again after ->direct_IO callback is done. So AFAICT you don't have to do anything except writing some nice comment about this ;). Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965089AbaDIVML (ORCPT ); Wed, 9 Apr 2014 17:12:11 -0400 Received: from cantor2.suse.de ([195.135.220.15]:36909 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965023AbaDIVMG (ORCPT ); Wed, 9 Apr 2014 17:12:06 -0400 Date: Wed, 9 Apr 2014 23:12:03 +0200 From: Jan Kara To: Matthew Wilcox Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Message-ID: <20140409211203.GP32103@quack.suse.cz> References: <20140408220525.GC26019@quack.suse.cz> <20140409204806.GF5727@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140409204806.GF5727@linux.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 09-04-14 16:48:06, Matthew Wilcox wrote: > On Wed, Apr 09, 2014 at 12:05:25AM +0200, Jan Kara wrote: > > > + > > > + error = dax_get_pfn(inode, &bh, &pfn); > > > + if (error > 0) > > > + error = vm_insert_mixed(vma, vaddr, pfn); > > When there's a hole (thus page != NULL) and we are called from > > dax_mkwrite(), this will always return EBUSY, correct? > > Erm ... it will return -EBUSY if this was the task that previously > faulted on it. Drat. See below. > > > > + mutex_unlock(&mapping->i_mmap_mutex); > > > + > > > + if (page) { > > > + delete_from_page_cache(page); > > > + unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, > > > + PAGE_CACHE_SIZE, 0); > > Here we unmap the PTE pointing to the hole page but then we'll have to > > retry the fault again to fill in the pfn we've got? This seems wrong. I'd > > say we want to remap the PTE from the hole page to a pfn we've got while > > holding i_mmap_mutex. remap_pfn_range() almost does what you need, except > > that you also need that to work for normal pages. So you might need to > > create a new helper in mm layer for that. > > I think it's easier than that. How does this look? > > @@ -390,9 +389,8 @@ static int do_dax_fault(struct vm_area_struct *vma, struct v > dax_clear_blocks(inode, bh.b_blocknr, bh.b_size); > > error = dax_get_pfn(&bh, &pfn, blkbits); > - if (error > 0) > - error = vm_insert_mixed(vma, vaddr, pfn); > - mutex_unlock(&mapping->i_mmap_mutex); > + if (error <= 0) > + goto unlock; > > if (page) { > delete_from_page_cache(page); > @@ -402,6 +400,9 @@ static int do_dax_fault(struct vm_area_struct *vma, struct v > page_cache_release(page); > } > > + error = vm_insert_mixed(vma, vaddr, pfn); > + mutex_unlock(&mapping->i_mmap_mutex); > + This would be fine except that unmap_mapping_range() grabs i_mmap_mutex again :-|. But it might be easier to provide a version of that function which assumes i_mmap_mutex is already locked than what I was suggesting. > if (error == -ENOMEM) > return VM_FAULT_OOM; > /* -EBUSY is fine, somebody else faulted on the same PTE */ > @@ -409,6 +410,8 @@ static int do_dax_fault(struct vm_area_struct *vma, struct v > BUG_ON(error); > return VM_FAULT_NOPAGE | major; > > + unlock: > + mutex_unlock(&mapping->i_mmap_mutex); > sigbus: > if (page) { > unlock_page(page); > > > > > +int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > > > + get_block_t get_block) > > > +{ > > > + int result; > > > + struct super_block *sb = file_inode(vma->vm_file)->i_sb; > > > + > > > + sb_start_pagefault(sb); > > You don't need any filesystem freeze protection for the fault handler > > since that's not going to modify the filesystem. > > Err ... we might allocate a block as a result of doing a write to a hole. > Or does that not count as 'modifying the filesystem' in this context? Ah, it does. But it would be nice to avoid doing sb_start_pagefault() if it's not a write fault - because you don't want to block reading from a frozen filesystem (imagine what would happen when you freeze your root filesystem to do a snapshot...). I have somewhat a mindset of standard pagecache mmap where filemap_fault() only reads in data regardless of FAULT_FLAG_WRITE setting so I was confused by your difference :). > > > + file_update_time(vma->vm_file); > > Why do you update m/ctime? We are only reading the file... > > ... except that it might be a write fault. I think we modify the file > iff we return VM_FAULT_MAJOR from do_dax_fault(). So I'd be open to > something like this: > > sb_start_pagefault(sb); > result = do_dax_fault(vma, vmf, get_block); > if (result & VM_FAULT_MAJOR) > file_update_time(vma->vm_file); > sb_end_pagefault(sb); > > Would that work better for you? Definitely. It's also a performance thing BTW - updating time stamps is relatively expensive for journalling filesystems - you have to start a transaction, add block with inode to the journal, stop a transaction - not something you want to do unless you have to. > > > @@ -70,7 +101,7 @@ const struct file_operations ext2_file_operations = { > > > #ifdef CONFIG_COMPAT > > > .compat_ioctl = ext2_compat_ioctl, > > > #endif > > > - .mmap = generic_file_mmap, > > > + .mmap = ext2_file_mmap, > > So what's the point of ext2_file_operations ever handling IS_DAX() > > inodes? Actually ext2_file_operations and ext2_xip_file_operations seem to > > be the same after this patch so either you drop ext2_xip_file_operations > > (I'm for this) or you can leave generic_file_mmap here and assume > > ext2_file_mmap is always called for IS_DAX() inodes. > > The goal is to get them the same. At this point, the only sticky point is: > > .splice_read = generic_file_splice_read, > .splice_write = generic_file_splice_write, > > And splice is pretty damn sticky for DAX. Yes, I have figured that out later. Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965056AbaDIVnf (ORCPT ); Wed, 9 Apr 2014 17:43:35 -0400 Received: from cantor2.suse.de ([195.135.220.15]:37316 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933908AbaDIVne (ORCPT ); Wed, 9 Apr 2014 17:43:34 -0400 Date: Wed, 9 Apr 2014 23:43:31 +0200 From: Jan Kara To: Matthew Wilcox Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Message-ID: <20140409214331.GQ32103@quack.suse.cz> References: <20140409102758.GM32103@quack.suse.cz> <20140409205111.GG5727@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140409205111.GG5727@linux.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 09-04-14 16:51:11, Matthew Wilcox wrote: > On Wed, Apr 09, 2014 at 12:27:58PM +0200, Jan Kara wrote: > > > + if (unlikely(vmf->pgoff >= size)) { > > > + mutex_unlock(&mapping->i_mmap_mutex); > > > + goto sigbus; > > You need to release the block you've got from the filesystem in case of > > error here an below. > > What's the API to do that? Call inode->i_op->setattr()? That's a great question. Yes, ->setattr() is the only API you have for that but you cannot use that because of locking constraints (it needs i_mutex and that's not possible to get in the fault path). Let me read again what the handler does... So there are three places that can fail after we allocate the block: 1) We race with truncate reducing i_size 2) dax_get_pfn() fails 3) vm_insert_mixed() fails I would guess that 2) can fail only if the HW has problems and leaking block in that case could be acceptable (please correct me if I'm wrong). 3) shouldn't fail because of ENOMEM because fault has already allocated all the page tables and EBUSY should be handled as well. So the only failure we have to care about is 1). And we could move ->get_block() call under i_mmap_mutex after the i_size check. Lock ordering should be fine because i_mmap_mutex ranks above page lock under which we do block mapping in standard ->page_mkwrite callbacks. The only (big) drawback is that i_mmap_mutex will now be held for much longer time and thus the contention would be much higher. But hopefully once we resolve our problems with mmap_sem and introduce mapping range lock we could scale reasonably. Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935399AbaDJOQz (ORCPT ); Thu, 10 Apr 2014 10:16:55 -0400 Received: from mga02.intel.com ([134.134.136.20]:44835 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934295AbaDJOQx (ORCPT ); Thu, 10 Apr 2014 10:16:53 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,834,1389772800"; d="scan'208";a="518521340" Date: Thu, 10 Apr 2014 10:16:30 -0400 From: Matthew Wilcox To: Jan Kara Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 11/22] Replace ext2_clear_xip_target with dax_clear_blocks Message-ID: <20140410141630.GH5727@linux.intel.com> References: <20140409094644.GD32103@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140409094644.GD32103@quack.suse.cz> User-Agent: Mutt/1.5.22 (2013-10-16) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 09, 2014 at 11:46:44AM +0200, Jan Kara wrote: > Another day, some more review ;) Comments below. I'm really grateful for all this review! It's killing me, though ;-) > > +int dax_clear_blocks(struct inode *inode, sector_t block, long size) > > +{ > > + struct block_device *bdev = inode->i_sb->s_bdev; > > + const struct block_device_operations *ops = bdev->bd_disk->fops; > > + sector_t sector = block << (inode->i_blkbits - 9); > > + unsigned long pfn; > > + > > + might_sleep(); > > + do { > > + void *addr; > > + long count = ops->direct_access(bdev, sector, &addr, &pfn, > > + size); > So do you assume blocksize == PAGE_SIZE here? If not, addr could be in > the middle of the page AFAICT. You're right. Depending on how clear_page() is implemented, that might go badly wrong. Of course, both ext2 & ext4 require block_size == PAGE_SIZE right now, so anything else is by definition untested. I've been trying to keep DAX free from that assumption, but obviously haven't caught all the places. How does this look? typedef long (*direct_access_t)(struct block_device *, sector_t, void **, unsigned long *pfn, long size); int dax_clear_blocks(struct inode *inode, sector_t block, long size) { struct block_device *bdev = inode->i_sb->s_bdev; direct_access_t direct_access = bdev->bd_disk->fops->direct_access; sector_t sector = block << (inode->i_blkbits - 9); unsigned long pfn; might_sleep(); do { void *addr; long count = direct_access(bdev, sector, &addr, &pfn, size); if (count < 0) return count; while (count > 0) { unsigned pgsz = PAGE_SIZE - offset_in_page(addr); if (pgsz > count) pgsz = count; if (pgsz < PAGE_SIZE) memset(addr, 0, pgsz); else clear_page(addr); addr += pgsz; size -= pgsz; count -= pgsz; sector += pgsz / 512; cond_resched(); } } while (size); return 0; } EXPORT_SYMBOL_GPL(dax_clear_blocks); > > if (IS_DAX(inode)) { > > /* > > - * we need to clear the block > > + * block must be initialised before we put it in the tree > > + * so that it's not found by another thread before it's > > + * initialised > > */ > > - err = ext2_clear_xip_target (inode, > > - le32_to_cpu(chain[depth-1].key)); > > + err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key), > > + count << inode->i_blkbits); > Umm 'count' looks wrong here. You want to clear only one block, don't > you? I think I got confused between ext2 and ext4 here. I do want to clear only one block. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965163AbaDJOYk (ORCPT ); Thu, 10 Apr 2014 10:24:40 -0400 Received: from mga09.intel.com ([134.134.136.24]:47568 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934688AbaDJOYh (ORCPT ); Thu, 10 Apr 2014 10:24:37 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,834,1389772800"; d="scan'208";a="518525868" Date: Thu, 10 Apr 2014 10:23:54 -0400 From: Matthew Wilcox To: Jan Kara Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 15/22] Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX Message-ID: <20140410142354.GJ5727@linux.intel.com> References: <20140409095918.GG32103@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140409095918.GG32103@quack.suse.cz> User-Agent: Mutt/1.5.22 (2013-10-16) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 09, 2014 at 11:59:18AM +0200, Jan Kara wrote: > On Sun 23-03-14 15:08:41, Matthew Wilcox wrote: > > The fewer Kconfig options we have the better. Use the generic > > CONFIG_FS_DAX to enable XIP support in ext2 as well as in the core. > > > > Signed-off-by: Matthew Wilcox > Looks good. You can add: > Reviewed-by: Jan Kara > > BTW: Its really only 2KB of code? I changed it in a later patch ... it's about 5kB of code. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935625AbaDJO2G (ORCPT ); Thu, 10 Apr 2014 10:28:06 -0400 Received: from mga09.intel.com ([134.134.136.24]:56466 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934688AbaDJO2E (ORCPT ); Thu, 10 Apr 2014 10:28:04 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,834,1389772800"; d="scan'208";a="490797977" Date: Thu, 10 Apr 2014 10:27:29 -0400 From: Matthew Wilcox To: Jan Kara Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ross Zwisler Subject: Re: [PATCH v7 18/22] xip: Add xip_zero_page_range Message-ID: <20140410142729.GL5727@linux.intel.com> References: <5a87acda8c3e4d2b7ea5dd1249fcbf8be23b9645.1395591795.git.matthew.r.wilcox@intel.com> <20140409101512.GL32103@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140409101512.GL32103@quack.suse.cz> User-Agent: Mutt/1.5.22 (2013-10-16) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 09, 2014 at 12:15:12PM +0200, Jan Kara wrote: > > + /* > > + * ext4 sometimes asks to zero past the end of a block. It > > + * really just wants to zero to the end of the block. > > + */ > Then we should really fix ext4 I believe... Since I didn't want to do this ... > > +/* Can't be a function because PAGE_CACHE_SIZE is defined in pagemap.h */ > > +#define dax_truncate_page(inode, from, get_block) \ > > + dax_zero_page_range(inode, from, PAGE_CACHE_SIZE, get_block) > ^^^^ > This should be (PAGE_CACHE_SIZE - (from & (PAGE_CACHE_SIZE - 1))), shouldn't it? ... I could get away without doing that ;-) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758735AbaDJSbJ (ORCPT ); Thu, 10 Apr 2014 14:31:09 -0400 Received: from cantor2.suse.de ([195.135.220.15]:56262 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753006AbaDJSbH (ORCPT ); Thu, 10 Apr 2014 14:31:07 -0400 Date: Thu, 10 Apr 2014 20:31:04 +0200 From: Jan Kara To: Matthew Wilcox Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 11/22] Replace ext2_clear_xip_target with dax_clear_blocks Message-ID: <20140410183104.GA8060@quack.suse.cz> References: <20140409094644.GD32103@quack.suse.cz> <20140410141630.GH5727@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140410141630.GH5727@linux.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 10-04-14 10:16:30, Matthew Wilcox wrote: > On Wed, Apr 09, 2014 at 11:46:44AM +0200, Jan Kara wrote: > > Another day, some more review ;) Comments below. > > I'm really grateful for all this review! It's killing me, though ;-) Yeah, I know that feeling. :) > > > +int dax_clear_blocks(struct inode *inode, sector_t block, long size) > > > +{ > > > + struct block_device *bdev = inode->i_sb->s_bdev; > > > + const struct block_device_operations *ops = bdev->bd_disk->fops; > > > + sector_t sector = block << (inode->i_blkbits - 9); > > > + unsigned long pfn; > > > + > > > + might_sleep(); > > > + do { > > > + void *addr; > > > + long count = ops->direct_access(bdev, sector, &addr, &pfn, > > > + size); > > So do you assume blocksize == PAGE_SIZE here? If not, addr could be in > > the middle of the page AFAICT. > > You're right. Depending on how clear_page() is implemented, that > might go badly wrong. Of course, both ext2 & ext4 require block_size > == PAGE_SIZE right now, so anything else is by definition untested. > I've been trying to keep DAX free from that assumption, but obviously > haven't caught all the places. > > How does this look? That looks fine. Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935455AbaDJSf3 (ORCPT ); Thu, 10 Apr 2014 14:35:29 -0400 Received: from cantor2.suse.de ([195.135.220.15]:56316 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934885AbaDJSf1 (ORCPT ); Thu, 10 Apr 2014 14:35:27 -0400 Date: Thu, 10 Apr 2014 20:35:26 +0200 From: Jan Kara To: Matthew Wilcox Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 12/22] ext2: Remove ext2_xip_verify_sb() Message-ID: <20140410183526.GB8060@quack.suse.cz> References: <5f91cb658e1ee1b593be9fd719e8f204b0069031.1395591795.git.matthew.r.wilcox@intel.com> <20140409095254.GE32103@quack.suse.cz> <20140410142254.GI5727@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140410142254.GI5727@linux.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 10-04-14 10:22:54, Matthew Wilcox wrote: > On Wed, Apr 09, 2014 at 11:52:54AM +0200, Jan Kara wrote: > > > - if ((sbi->s_mount_opt ^ old_mount_opt) & EXT2_MOUNT_XIP) { > > > + if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) { > > > ext2_msg(sb, KERN_WARNING, "warning: refusing change of " > > > "xip flag with busy inodes while remounting"); > > > - sbi->s_mount_opt &= ~EXT2_MOUNT_XIP; > > > - sbi->s_mount_opt |= old_mount_opt & EXT2_MOUNT_XIP; > > > + sbi->s_mount_opt ^= EXT2_MOUNT_XIP; > > Although this is correct, it was easier to see that the previous code is > > correct so I'd prefer if you kept it that way. > > Depends how you think about it. I think of foo ^= bar as 'toggle the > bar bit in foo'. So I read the code as 'If the mount bit is incorrect, > print an error and toggle the bit'. I think you're reading the old code > as 'If the new mount bit differs from the old mount bit, make sure the > new mount bit is the same as the old mount bit'. Yeah, since it's pretty obvious what the code should do, one can figure out it is correct relatively quickly. But it's something that wasn't obvious to me at the first sight. If you really prefer your way, I can live with that... Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758908AbaDJSkP (ORCPT ); Thu, 10 Apr 2014 14:40:15 -0400 Received: from cantor2.suse.de ([195.135.220.15]:56383 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753322AbaDJSkM (ORCPT ); Thu, 10 Apr 2014 14:40:12 -0400 Date: Thu, 10 Apr 2014 20:40:10 +0200 From: Jan Kara To: Matthew Wilcox Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 17/22] Get rid of most mentions of XIP in ext2 Message-ID: <20140410184010.GC8060@quack.suse.cz> References: <0b13a744db9bfca33938bc1576f7eb7bfc9c41c2.1395591795.git.matthew.r.wilcox@intel.com> <20140409100435.GJ32103@quack.suse.cz> <20140410142625.GK5727@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140410142625.GK5727@linux.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 10-04-14 10:26:25, Matthew Wilcox wrote: > On Wed, Apr 09, 2014 at 12:04:35PM +0200, Jan Kara wrote: > > On Sun 23-03-14 15:08:43, Matthew Wilcox wrote: > > > The only remaining usage is userspace's 'xip' option. > > Looks good. You can add: > > Reviewed-by: Jan Kara > > I've been thinking about this patch, and I'm not happy with it any more :-) > > I want to migrate people away from using 'xip' to 'dax' without breaking > anybody's scripts. So I'm thinking about adding a new 'dax' option and > having the 'xip' option print a warning and force-enable the 'dax' option. > That way people who might have scripts to look for 'xip' in /proc/mounts > won't break. Yeah, that sounds reasonable. Maybe we could even show only 'dax' in /proc/mounts since I somewhat doubt there are any users who care. But showing also 'xip' when used is easy enough so why not. Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935647AbaDJSnu (ORCPT ); Thu, 10 Apr 2014 14:43:50 -0400 Received: from cantor2.suse.de ([195.135.220.15]:56426 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933885AbaDJSns (ORCPT ); Thu, 10 Apr 2014 14:43:48 -0400 Date: Thu, 10 Apr 2014 20:43:46 +0200 From: Jan Kara To: Matthew Wilcox Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ross Zwisler Subject: Re: [PATCH v7 18/22] xip: Add xip_zero_page_range Message-ID: <20140410184346.GD8060@quack.suse.cz> References: <5a87acda8c3e4d2b7ea5dd1249fcbf8be23b9645.1395591795.git.matthew.r.wilcox@intel.com> <20140409101512.GL32103@quack.suse.cz> <20140410142729.GL5727@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140410142729.GL5727@linux.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 10-04-14 10:27:29, Matthew Wilcox wrote: > On Wed, Apr 09, 2014 at 12:15:12PM +0200, Jan Kara wrote: > > > + /* > > > + * ext4 sometimes asks to zero past the end of a block. It > > > + * really just wants to zero to the end of the block. > > > + */ > > Then we should really fix ext4 I believe... > > Since I didn't want to do this ... > > > > +/* Can't be a function because PAGE_CACHE_SIZE is defined in pagemap.h */ > > > +#define dax_truncate_page(inode, from, get_block) \ > > > + dax_zero_page_range(inode, from, PAGE_CACHE_SIZE, get_block) > > ^^^^ > > This should be (PAGE_CACHE_SIZE - (from & (PAGE_CACHE_SIZE - 1))), shouldn't it? > > ... I could get away without doing that ;-) I understand but ultimately the API is cleaner if it doesn't allow size past end of block. So IMHO we shouldn't introduce new places that call the function like this and we should fix places that do it now (make it WARN_ON_ONCE() and let ext4 guys do the work for you ;). Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754381AbaDMLVh (ORCPT ); Sun, 13 Apr 2014 07:21:37 -0400 Received: from mga11.intel.com ([192.55.52.93]:23732 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750800AbaDMLVf (ORCPT ); Sun, 13 Apr 2014 07:21:35 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,851,1389772800"; d="scan'208";a="519693502" Date: Sun, 13 Apr 2014 07:21:32 -0400 From: Matthew Wilcox To: Jan Kara Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Message-ID: <20140413112132.GP5727@linux.intel.com> References: <20140408220525.GC26019@quack.suse.cz> <20140409204806.GF5727@linux.intel.com> <20140409211203.GP32103@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140409211203.GP32103@quack.suse.cz> User-Agent: Mutt/1.5.22 (2013-10-16) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 09, 2014 at 11:12:03PM +0200, Jan Kara wrote: > This would be fine except that unmap_mapping_range() grabs i_mmap_mutex > again :-|. But it might be easier to provide a version of that function > which assumes i_mmap_mutex is already locked than what I was suggesting. *sigh*. I knew that once ... which was why the call was after dropping the lock. OK, another try at fixing the problem; handle it down in the insert_pfn code: diff --git a/fs/dax.c b/fs/dax.c index 6a8725b..2453025 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -390,7 +390,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, error = dax_get_pfn(&bh, &pfn, blkbits); if (error > 0) - error = vm_insert_mixed(vma, vaddr, pfn); + error = vm_replace_mixed(vma, vaddr, pfn); mutex_unlock(&mapping->i_mmap_mutex); if (page) { diff --git a/include/linux/mm.h b/include/linux/mm.h index ba72c54..df25410 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1944,8 +1944,12 @@ int remap_pfn_range(struct vm_area_struct *, unsigned long addr, int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *); int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); -int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, - unsigned long pfn); +int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn, bool replace); +#define vm_insert_mixed(vma, addr, pfn) \ + __vm_insert_mixed(vma, addr, pfn, false) +#define vm_replace_mixed(vma, addr, pfn) \ + __vm_insert_mixed(vma, addr, pfn, true) int vm_insert_pfn_pmd(struct vm_area_struct *, unsigned long addr, pmd_t *, unsigned long pfn); int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len); diff --git a/mm/memory.c b/mm/memory.c index 76fd657..ec59239 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2100,7 +2100,7 @@ pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr, * pages reserved for the old functions anyway. */ static int insert_page(struct vm_area_struct *vma, unsigned long addr, - struct page *page, pgprot_t prot) + struct page *page, pgprot_t prot, bool replace) { struct mm_struct *mm = vma->vm_mm; int retval; @@ -2116,8 +2116,12 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr, if (!pte) goto out; retval = -EBUSY; - if (!pte_none(*pte)) - goto out_unlock; + if (!pte_none(*pte)) { + if (!replace) + goto out_unlock; + VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex)); + zap_page_range_single(vma, addr, PAGE_SIZE, NULL); + } /* Ok, finally just insert the thing.. */ get_page(page); @@ -2173,12 +2177,12 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr, BUG_ON(vma->vm_flags & VM_PFNMAP); vma->vm_flags |= VM_MIXEDMAP; } - return insert_page(vma, addr, page, vma->vm_page_prot); + return insert_page(vma, addr, page, vma->vm_page_prot, false); } EXPORT_SYMBOL(vm_insert_page); static int insert_pfn(struct vm_area_struct *vma, unsigned long addr, - unsigned long pfn, pgprot_t prot) + unsigned long pfn, pgprot_t prot, bool replace) { struct mm_struct *mm = vma->vm_mm; int retval; @@ -2190,8 +2194,12 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr, if (!pte) goto out; retval = -EBUSY; - if (!pte_none(*pte)) - goto out_unlock; + if (!pte_none(*pte)) { + if (!replace) + goto out_unlock; + VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex)); + zap_page_range_single(vma, addr, PAGE_SIZE, NULL); + } /* Ok, finally just insert the thing.. */ entry = pte_mkspecial(pfn_pte(pfn, prot)); @@ -2244,14 +2252,14 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, if (track_pfn_insert(vma, &pgprot, pfn)) return -EINVAL; - ret = insert_pfn(vma, addr, pfn, pgprot); + ret = insert_pfn(vma, addr, pfn, pgprot, false); return ret; } EXPORT_SYMBOL(vm_insert_pfn); -int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, - unsigned long pfn) +int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn, bool replace) { BUG_ON(!(vma->vm_flags & VM_MIXEDMAP)); @@ -2269,11 +2277,11 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, struct page *page; page = pfn_to_page(pfn); - return insert_page(vma, addr, page, vma->vm_page_prot); + return insert_page(vma, addr, page, vma->vm_page_prot, replace); } - return insert_pfn(vma, addr, pfn, vma->vm_page_prot); + return insert_pfn(vma, addr, pfn, vma->vm_page_prot, replace); } -EXPORT_SYMBOL(vm_insert_mixed); +EXPORT_SYMBOL(__vm_insert_mixed); static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, unsigned long pfn, pgprot_t prot) > > > > +int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > > > > + get_block_t get_block) > > > > +{ > > > > + int result; > > > > + struct super_block *sb = file_inode(vma->vm_file)->i_sb; > > > > + > > > > + sb_start_pagefault(sb); > > > You don't need any filesystem freeze protection for the fault handler > > > since that's not going to modify the filesystem. > > > > Err ... we might allocate a block as a result of doing a write to a hole. > > Or does that not count as 'modifying the filesystem' in this context? > Ah, it does. But it would be nice to avoid doing sb_start_pagefault() if > it's not a write fault - because you don't want to block reading from a > frozen filesystem (imagine what would happen when you freeze your root > filesystem to do a snapshot...). > > I have somewhat a mindset of standard pagecache mmap where filemap_fault() > only reads in data regardless of FAULT_FLAG_WRITE setting so I was confused > by your difference :). Understood! So this should work: diff --git a/fs/dax.c b/fs/dax.c index 2453025..e4d00fc 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -431,10 +431,13 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, int result; struct super_block *sb = file_inode(vma->vm_file)->i_sb; - sb_start_pagefault(sb); - file_update_time(vma->vm_file); + if (vmf->flags & FAULT_FLAG_WRITE) { + sb_start_pagefault(sb); + file_update_time(vma->vm_file); + } result = do_dax_fault(vma, vmf, get_block); - sb_end_pagefault(sb); + if (vmf->flags & FAULT_FLAG_WRITE) + sb_end_pagefault(sb); return result; } @@ -453,15 +456,7 @@ EXPORT_SYMBOL_GPL(dax_fault); int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, get_block_t get_block) { - int result; - struct super_block *sb = file_inode(vma->vm_file)->i_sb; - - sb_start_pagefault(sb); - file_update_time(vma->vm_file); - result = do_dax_fault(vma, vmf, get_block); - sb_end_pagefault(sb); - - return result; + return dax_fault(vma, vmf, get_block); } EXPORT_SYMBOL_GPL(dax_mkwrite); > > > > + file_update_time(vma->vm_file); > > > Why do you update m/ctime? We are only reading the file... > > > > ... except that it might be a write fault. I think we modify the file > > iff we return VM_FAULT_MAJOR from do_dax_fault(). So I'd be open to > > something like this: > > > > sb_start_pagefault(sb); > > result = do_dax_fault(vma, vmf, get_block); > > if (result & VM_FAULT_MAJOR) > > file_update_time(vma->vm_file); > > sb_end_pagefault(sb); > > > > Would that work better for you? > Definitely. It's also a performance thing BTW - updating time stamps is > relatively expensive for journalling filesystems - you have to start a > transaction, add block with inode to the journal, stop a transaction - not > something you want to do unless you have to. I realised that this isn't right. If you do a store to an mmaped file, you should update the timestamps, whether or not the fs had to allocate blocks. Hence the version above that only checks whether the fault is for write or not. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755243AbaDMSDe (ORCPT ); Sun, 13 Apr 2014 14:03:34 -0400 Received: from mga01.intel.com ([192.55.52.88]:14809 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754814AbaDMSDc (ORCPT ); Sun, 13 Apr 2014 14:03:32 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,852,1389772800"; d="scan'208";a="519803881" Date: Sun, 13 Apr 2014 14:03:29 -0400 From: Matthew Wilcox To: Jan Kara Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Message-ID: <20140413180329.GR5727@linux.intel.com> References: <20140409102758.GM32103@quack.suse.cz> <20140409205111.GG5727@linux.intel.com> <20140409214331.GQ32103@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140409214331.GQ32103@quack.suse.cz> User-Agent: Mutt/1.5.22 (2013-10-16) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 09, 2014 at 11:43:31PM +0200, Jan Kara wrote: > On Wed 09-04-14 16:51:11, Matthew Wilcox wrote: > > On Wed, Apr 09, 2014 at 12:27:58PM +0200, Jan Kara wrote: > > > > + if (unlikely(vmf->pgoff >= size)) { > > > > + mutex_unlock(&mapping->i_mmap_mutex); > > > > + goto sigbus; > > > You need to release the block you've got from the filesystem in case of > > > error here an below. > > > > What's the API to do that? Call inode->i_op->setattr()? > That's a great question. Yes, ->setattr() is the only API you have for > that but you cannot use that because of locking constraints (it needs > i_mutex and that's not possible to get in the fault path). Let me read > again what the handler does... > > So there are three places that can fail after we allocate the block: > 1) We race with truncate reducing i_size > 2) dax_get_pfn() fails > 3) vm_insert_mixed() fails > > I would guess that 2) can fail only if the HW has problems and leaking > block in that case could be acceptable (please correct me if I'm wrong). > 3) shouldn't fail because of ENOMEM because fault has already allocated all > the page tables and EBUSY should be handled as well. So the only failure we > have to care about is 1). And we could move ->get_block() call under > i_mmap_mutex after the i_size check. Lock ordering should be fine because > i_mmap_mutex ranks above page lock under which we do block mapping in > standard ->page_mkwrite callbacks. The only (big) drawback is that > i_mmap_mutex will now be held for much longer time and thus the contention > would be much higher. But hopefully once we resolve our problems with > mmap_sem and introduce mapping range lock we could scale reasonably. I think you're right about the only failure case to worry about being (1). For 2 or 3, we haven't *leaked* the block, we've merely allocated it, found out we couldn't use it, and then not freed it. It'll be freed when the file is deleted or truncated. Taking the i_mmap_mutex earlier looks reasonable. I'll do that. As far as reducing contention on i_mmap_mutex goes, I'm currently planning on using an exceptional entry in the radix tree, designating one bit of that as the lock bit and using the remaining 29 / 61 bits to cache the PFN. That lock would then have the same rank as the page lock. It might be interesting to build that kind of 'locking' into the radix tree ... I'm half-thinking about taking a lock higher in the radix tree to cover large pages. I'll probably just use the lock bit in the entry that would cover the head page, though. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755369AbaDMSF5 (ORCPT ); Sun, 13 Apr 2014 14:05:57 -0400 Received: from mga11.intel.com ([192.55.52.93]:12450 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755148AbaDMSFz (ORCPT ); Sun, 13 Apr 2014 14:05:55 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,852,1389772800"; d="scan'208";a="519804655" Date: Sun, 13 Apr 2014 14:05:52 -0400 From: Matthew Wilcox To: Jan Kara Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 06/22] Replace XIP read and write with DAX I/O Message-ID: <20140413180552.GS5727@linux.intel.com> References: <3ebe329d8713f7db4c105021a845316a47a29797.1395591795.git.matthew.r.wilcox@intel.com> <20140408175600.GE2713@quack.suse.cz> <20140408202102.GB5727@linux.intel.com> <20140409091450.GA32103@quack.suse.cz> <20140409151908.GD5727@linux.intel.com> <20140409205529.GO32103@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140409205529.GO32103@quack.suse.cz> User-Agent: Mutt/1.5.22 (2013-10-16) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 09, 2014 at 10:55:29PM +0200, Jan Kara wrote: > > In addition to writing back dirty pages, filemap_write_and_wait_range() > > will evict clean pages. Unintuitive, I know, but it matches what the > > direct I/O path does. Plus, if we fall back to buffered I/O for holes > > (see above), then this will do the right thing at that time. > Ugh, I'm pretty certain filemap_write_and_wait_range() doesn't evict > anything ;). Direct IO path calls that function so that direct IO read > after buffered write returns the written data. In that case we don't evict > anything from page cache because direct IO read doesn't invalidate any > information we have cached. Only direct IO write does that and for that we > call invalidate_inode_pages2_range() after writing the pages. So I maintain > that what you do doesn't make sense to me. You might need to do some > invalidation of hole pages. But note that generic_file_direct_write() does > that for you and even though that isn't serialized in any way with page > faults which can instantiate the hole pages again, things should work out > fine for you since that function also invalidates the range again after > ->direct_IO callback is done. So AFAICT you don't have to do anything > except writing some nice comment about this ;). You're right. I'm not sure what I got confused with there. I don't think there's a race I need to worry about ... even if another page gets instantiated (consider one thread furiously loading from a hole as fast as it can while another thread does a write), we'll shoot it down again. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751369AbaDMWhL (ORCPT ); Sun, 13 Apr 2014 18:37:11 -0400 Received: from mga02.intel.com ([134.134.136.20]:58613 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750958AbaDMWhJ (ORCPT ); Sun, 13 Apr 2014 18:37:09 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,853,1389772800"; d="scan'208";a="492293363" Date: Sun, 13 Apr 2014 15:07:21 -0400 From: Matthew Wilcox To: Jan Kara Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 08/22] Replace xip_truncate_page with dax_truncate_page Message-ID: <20140413190721.GA21460@linux.intel.com> References: <20140408221759.GD26019@quack.suse.cz> <20140409092635.GB32103@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140409092635.GB32103@quack.suse.cz> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 09, 2014 at 11:26:35AM +0200, Jan Kara wrote: > I thought about this for a while and classical IO, truncation etc. could > easily work for blocksize < pagesize. And for mmap() you could just use > pagecache. Not sure if it's worth the complications though. Anyway we > should decide whether we don't care about blocksize < PAGE_CACHE_SIZE at > all, or whether we try to make things which can work reasonably easily > functional. In that case dax_truncate_page() needs some tweaking because it > currently assumes blocksize == PAGE_CACHE_SIZE. I think it actually assumes that blocksize <= PAGE_CACHE_SIZE in that it doesn't contain a loop to iterate over all blocks. It wouldn't be hard to fix but I'll just put in a comment noting what needs to be fixed ... I don't think there's going to be a lot of enthusiasm for adding support for blocksize != PAGE_SIZE / PAGE_CACHE_SIZE. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754860AbaDNQFF (ORCPT ); Mon, 14 Apr 2014 12:05:05 -0400 Received: from cantor2.suse.de ([195.135.220.15]:37503 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753306AbaDNQFB (ORCPT ); Mon, 14 Apr 2014 12:05:01 -0400 Date: Mon, 14 Apr 2014 18:04:57 +0200 From: Jan Kara To: Matthew Wilcox Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Message-ID: <20140414160457.GB13860@quack.suse.cz> References: <20140408220525.GC26019@quack.suse.cz> <20140409204806.GF5727@linux.intel.com> <20140409211203.GP32103@quack.suse.cz> <20140413112132.GP5727@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140413112132.GP5727@linux.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 13-04-14 07:21:32, Matthew Wilcox wrote: > On Wed, Apr 09, 2014 at 11:12:03PM +0200, Jan Kara wrote: > > This would be fine except that unmap_mapping_range() grabs i_mmap_mutex > > again :-|. But it might be easier to provide a version of that function > > which assumes i_mmap_mutex is already locked than what I was suggesting. > > *sigh*. I knew that once ... which was why the call was after dropping > the lock. OK, another try at fixing the problem; handle it down in the > insert_pfn code: OK, that change looks OK to me (although you might want to introduce vm_replace_mixed() in a separate patch). > > > > > +int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > > > > > + get_block_t get_block) > > > > > +{ > > > > > + int result; > > > > > + struct super_block *sb = file_inode(vma->vm_file)->i_sb; > > > > > + > > > > > + sb_start_pagefault(sb); > > > > You don't need any filesystem freeze protection for the fault handler > > > > since that's not going to modify the filesystem. > > > > > > Err ... we might allocate a block as a result of doing a write to a hole. > > > Or does that not count as 'modifying the filesystem' in this context? > > Ah, it does. But it would be nice to avoid doing sb_start_pagefault() if > > it's not a write fault - because you don't want to block reading from a > > frozen filesystem (imagine what would happen when you freeze your root > > filesystem to do a snapshot...). > > > > I have somewhat a mindset of standard pagecache mmap where filemap_fault() > > only reads in data regardless of FAULT_FLAG_WRITE setting so I was confused > > by your difference :). > > Understood! So this should work: > > diff --git a/fs/dax.c b/fs/dax.c > index 2453025..e4d00fc 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -431,10 +431,13 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > int result; > struct super_block *sb = file_inode(vma->vm_file)->i_sb; > > - sb_start_pagefault(sb); > - file_update_time(vma->vm_file); > + if (vmf->flags & FAULT_FLAG_WRITE) { > + sb_start_pagefault(sb); > + file_update_time(vma->vm_file); > + } Yup, this looks good to me. Later if we find file_update_time() is slowing down faults too much, we can defer the actual update to msync() / close() time (POSIX actually allows that). But that's definitely for future. > result = do_dax_fault(vma, vmf, get_block); > - sb_end_pagefault(sb); > + if (vmf->flags & FAULT_FLAG_WRITE) > + sb_end_pagefault(sb); > > return result; > } Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751463AbaERO6W (ORCPT ); Sun, 18 May 2014 10:58:22 -0400 Received: from mail-wg0-f48.google.com ([74.125.82.48]:63683 "EHLO mail-wg0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751211AbaERO6U (ORCPT ); Sun, 18 May 2014 10:58:20 -0400 Message-ID: <5378CA88.3080105@gmail.com> Date: Sun, 18 May 2014 17:58:16 +0300 From: Boaz Harrosh User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0 MIME-Version: 1.0 To: Matthew Wilcox , linux-kernel@vger.kernel.org, Sagi Manole CC: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, willy@linux.intel.com Subject: Re: [PATCH v7 00/22] Support ext4 on NV-DIMMs References: In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/23/2014 09:08 PM, Matthew Wilcox wrote: > One of the primary uses for NV-DIMMs is to expose them as a block device > and use a filesystem to store files on the NV-DIMM. While that works, > it currently wastes memory and CPU time buffering the files in the page > cache. We have support in ext2 for bypassing the page cache, but it > has some races which are unfixable in the current design. This series > of patches rewrite the underlying support, and add support for direct > access to ext4. > > This iteration of the patchset rebases to Linus' 3.14-rc7 (plus Kirill's > patches in linux-next http://marc.info/?l=linux-mm&m=139206489208546&w=2) Hi Matthew We are experimenting with NV-DIMMs. The experiment will use its own FS not based on ext4 at all, more like the infamous PMFS but we want to start DAX based and not current XIP based. We want to make sure the proposed new API can be utilized stand alone and there are no extX based assumptions. (Like the need for direct directory access instead of the ext4 copy-from-nvdimm-to-ram directory) Could you please put these patches on a public tree somewhere, or perhaps some later version, that I can pull directly from? this would help alot. These patches are a bit hard to patch because it is not clear what Kirill's patches I need. I tried some linux-next version around 3.14-rc7 that also include Kirill's patches but it looks like there was farther work done then your base. I was able to produce a tree with V6 of your patches but I would hate to do that manual work yet again. (Any linux base is fine just that I can pull it) Thanks Also I'm curios. I see you guys where working on PMFS for a while fixing and enhancing stuff. Then development stopped and these DAX patches started showing. Now, PMFS is based on current XIP (I was able to easily port it to 3.14-rc7). Do you guys have an Internal attempt to port PMFS to DAX? (We might do it in future just as an exercise to get intimate with DAX and to make sure nothing is missing.) What are your plans with PMFS is it dead? Good day Boaz > and fixes several bugs: > > - Initialise cow_page in do_page_mkwrite() (Matthew Wilcox) > - Clear new or unwritten blocks in page fault handler (Matthew Wilcox) > - Only call get_block when necessary (Matthew Wilcox) > - Reword Kconfig options (Matthew Wilcox / Vishal Verma) > - Fix a race between page fault and truncate (Matthew Wilcox) > - Fix a race between fault-for-read and fault-for-write (Matthew Wilcox) > - Zero the correct bytes in dax_new_buf() (Toshi Kani) > - Add DIO_LOCKING to an invocation of dax_do_io in ext4 (Ross Zwisler) > > Relative to the last patchset, I folded the 'Add reporting of major faults' > patch into the patch that adds the DAX page fault handler. > > The v6 patchset had seven additional xfstests failures. This patchset > now passes approximately as many xfstests as ext4 does on a ramdisk. > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752379AbaERXYE (ORCPT ); Sun, 18 May 2014 19:24:04 -0400 Received: from mga01.intel.com ([192.55.52.88]:33413 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752100AbaERXYC (ORCPT ); Sun, 18 May 2014 19:24:02 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.98,864,1392192000"; d="scan'208";a="533877626" Date: Sun, 18 May 2014 19:24:03 -0400 From: Matthew Wilcox To: Boaz Harrosh Cc: Matthew Wilcox , linux-kernel@vger.kernel.org, Sagi Manole , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH v7 00/22] Support ext4 on NV-DIMMs Message-ID: <20140518232403.GF6121@linux.intel.com> References: <5378CA88.3080105@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5378CA88.3080105@gmail.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, May 18, 2014 at 05:58:16PM +0300, Boaz Harrosh wrote: > We are experimenting with NV-DIMMs. The experiment will use its own > FS not based on ext4 at all, more like the infamous PMFS but we want > to start DAX based and not current XIP based. We want to make sure the proposed > new API can be utilized stand alone and there are no extX based assumptions. > (Like the need for direct directory access instead of the ext4 > copy-from-nvdimm-to-ram directory) Hi Boaz, Best of luck with your new filesystem. > Could you please put these patches on a public tree somewhere, or perhaps some > later version, that I can pull directly from? this would help alot. I'm preparing a v8 right now; probably be availble by the end of the week. > Also I'm curios. I see you guys where working on PMFS for a while > fixing and enhancing stuff. Then development stopped and these DAX > patches started showing. Now, PMFS is based on current XIP (I was able > to easily port it to 3.14-rc7). Do you guys have an Internal attempt > to port PMFS to DAX? (We might do it in future just as an exercise > to get intimate with DAX and to make sure nothing is missing.) > What are your plans with PMFS is it dead? My group has no plans to do any more work with PMFS, and I'm not aware of anyone else planning on turning PMFS into a production-quality filesystem. But the code is out there and we can't stop anybody else from working on it. PMFS uses neither DAX nor XIP; it doesn't sit on top of a block device. We would probably have moved it to sit on top of a block device by now had we been developing it further. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753346AbaEUUoU (ORCPT ); Wed, 21 May 2014 16:44:20 -0400 Received: from g2t1383g.austin.hp.com ([15.217.136.92]:16353 "EHLO g2t1383g.austin.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752629AbaEUUoR (ORCPT ); Wed, 21 May 2014 16:44:17 -0400 Message-ID: <1400704507.18128.23.camel@misato.fc.hp.com> Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler From: Toshi Kani To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com Date: Wed, 21 May 2014 14:35:07 -0600 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.8.5 (3.8.5-2.fc19) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, 2014-03-23 at 15:08 -0400, Matthew Wilcox wrote: : > +static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > + get_block_t get_block) > +{ : > + error = dax_get_pfn(inode, &bh, &pfn); > + if (error > 0) > + error = vm_insert_mixed(vma, vaddr, pfn); > + mutex_unlock(&mapping->i_mmap_mutex); > + > + if (page) { > + delete_from_page_cache(page); > + unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, > + PAGE_CACHE_SIZE, 0); > + unlock_page(page); > + page_cache_release(page); Hi Matthew, I am seeing a problem in this code path, where it deletes a page cache page mapped to a hole. Sometimes, page->_mapcount is 0, not -1, which leads __delete_from_page_cache(), called from delete_from_page_cache(), to hit the following BUG_ON. BUG_ON(page_mapped(page)) I suppose such page has a shared mapping. Does this code need to take care of replacing shared mappings in such case? Thanks, -Toshi From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752703AbaFEWrt (ORCPT ); Thu, 5 Jun 2014 18:47:49 -0400 Received: from g2t1383g.austin.hp.com ([15.217.136.92]:48820 "EHLO g2t1383g.austin.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751651AbaFEWrr (ORCPT ); Thu, 5 Jun 2014 18:47:47 -0400 Message-ID: <1402007914.7963.8.camel@misato.fc.hp.com> Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler From: Toshi Kani To: Matthew Wilcox Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, willy@linux.intel.com Date: Thu, 05 Jun 2014 16:38:34 -0600 In-Reply-To: <1400704507.18128.23.camel@misato.fc.hp.com> References: <1400704507.18128.23.camel@misato.fc.hp.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.8.5 (3.8.5-2.fc19) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2014-05-21 at 14:35 -0600, Toshi Kani wrote: > On Sun, 2014-03-23 at 15:08 -0400, Matthew Wilcox wrote: > : > > +static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > > + get_block_t get_block) > > +{ > : > > + error = dax_get_pfn(inode, &bh, &pfn); > > + if (error > 0) > > + error = vm_insert_mixed(vma, vaddr, pfn); > > + mutex_unlock(&mapping->i_mmap_mutex); > > + > > + if (page) { > > + delete_from_page_cache(page); > > + unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, > > + PAGE_CACHE_SIZE, 0); > > + unlock_page(page); > > + page_cache_release(page); > > Hi Matthew, > > I am seeing a problem in this code path, where it deletes a page cache > page mapped to a hole. Sometimes, page->_mapcount is 0, not -1, which > leads __delete_from_page_cache(), called from delete_from_page_cache(), > to hit the following BUG_ON. > > BUG_ON(page_mapped(page)) > > I suppose such page has a shared mapping. Does this code need to take > care of replacing shared mappings in such case? Hi Matthew, The following change works in my environment. What do you think? Thanks, -Toshi --- fs/dax.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/dax.c b/fs/dax.c index 2d6b4bc..046c6d6 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -26,6 +26,7 @@ #include #include #include +#include int dax_clear_blocks(struct inode *inode, sector_t block, long size) { @@ -385,6 +386,8 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, mutex_unlock(&mapping->i_mmap_mutex); if (page) { + if (page_mapped(page)) + try_to_unmap(page, TTU_UNMAP|TTU_IGNORE_ACCESS); delete_from_page_cache(page); unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, PAGE_CACHE_SIZE, 0); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933101AbaFQSLx (ORCPT ); Tue, 17 Jun 2014 14:11:53 -0400 Received: from mail-pa0-f54.google.com ([209.85.220.54]:47489 "EHLO mail-pa0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932135AbaFQSLv (ORCPT ); Tue, 17 Jun 2014 14:11:51 -0400 Message-ID: <53A084E3.6080103@gmail.com> Date: Tue, 17 Jun 2014 21:11:47 +0300 From: Boaz Harrosh User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org CC: willy@linux.intel.com Subject: Re: [PATCH v7 00/22] Support ext4 on NV-DIMMs References: In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/23/2014 09:08 PM, Matthew Wilcox wrote: > One of the primary uses for NV-DIMMs is to expose them as a block device > and use a filesystem to store files on the NV-DIMM. While that works, > it currently wastes memory and CPU time buffering the files in the page > cache. We have support in ext2 for bypassing the page cache, but it > has some races which are unfixable in the current design. This series > of patches rewrite the underlying support, and add support for direct > access to ext4. > > This iteration of the patchset rebases to Linus' 3.14-rc7 (plus Kirill's > patches in linux-next http://marc.info/?l=linux-mm&m=139206489208546&w=2) > and fixes several bugs: > > - Initialise cow_page in do_page_mkwrite() (Matthew Wilcox) > - Clear new or unwritten blocks in page fault handler (Matthew Wilcox) > - Only call get_block when necessary (Matthew Wilcox) > - Reword Kconfig options (Matthew Wilcox / Vishal Verma) > - Fix a race between page fault and truncate (Matthew Wilcox) > - Fix a race between fault-for-read and fault-for-write (Matthew Wilcox) > - Zero the correct bytes in dax_new_buf() (Toshi Kani) > - Add DIO_LOCKING to an invocation of dax_do_io in ext4 (Ross Zwisler) > > Relative to the last patchset, I folded the 'Add reporting of major faults' > patch into the patch that adds the DAX page fault handler. > > The v6 patchset had seven additional xfstests failures. This patchset > now passes approximately as many xfstests as ext4 does on a ramdisk. > > Matthew Wilcox (21): > Fix XIP fault vs truncate race > Allow page fault handlers to perform the COW > axonram: Fix bug in direct_access > Change direct_access calling convention > Introduce IS_DAX(inode) > Replace XIP read and write with DAX I/O > Replace the XIP page fault handler with the DAX page fault handler > Replace xip_truncate_page with dax_truncate_page > Remove mm/filemap_xip.c > Remove get_xip_mem > Replace ext2_clear_xip_target with dax_clear_blocks > ext2: Remove ext2_xip_verify_sb() > ext2: Remove ext2_use_xip > ext2: Remove xip.c and xip.h > Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX > ext2: Remove ext2_aops_xip > Get rid of most mentions of XIP in ext2 > xip: Add xip_zero_page_range > ext4: Make ext4_block_zero_page_range static > ext4: Fix typos > brd: Rename XIP to DAX Hi Matthew I have some more trouble with DAX (and old XIP) please forgive me if I'm just senile and clueless. And put some sense into me. The title of this patchset is "ext4 on NV-DIMMs" But all I see is that DAX (and old XIP) is supported by mounting over brd devices. (On x86 I'm not sure about the other drivers) But looking to use brd with real NV_DIMMS fails miserably. (I'm talking about the RAM based NV_DIMMS (backed by flash) and not about the block based Diablo DDR bus flash devices type) Looking at the brd code I fail to see how it will ever support NV_DIMMS. brd is "struct page" based and shares RAM from the same memory pool as the rest of the system. But NV_DIMMS is not page-based and is excluded from the memory system. It needs to be exclusively owned by a device and the mounted FS. We currently have in our lab the old DDR3 based NV_DIMMS and on regular boot it appears as RAM. We need to use memmap= option on command line of Kernel to exclude it from use by Kernel. We have received our DDR4 based NV_DIMMS but still waiting for the actual system board to support it. As I understand from STD documentation these devices will not identify as RAM and will be exported as ACPI or SBUS devices that can be queried for sizes and address as well as properties about the chips. So I imagine a udev rule will need to probe the right driver to mount over those. So currently from what I can see only the infamous PMFS is the setup that can actually mount/support my NV_DIMMS today. It seems to me like we need a *new* block device that receives, like PMFS, an physical_address + size on load and will export this raw region as a block device. Of course with support of new DAX API. Should I send in such a device code. (I've seen the linux-nvdimm project on github but did not see how my above problem is addressed, it looks geared for that other type DDR bus devices) So please how is all that suppose to work, what is the strategy stack for all this? I guess for now I'm stuck with PMFS. (BTW: A public git tree of DAX patches ;-) ) Thanks Boaz > > Ross Zwisler (1): > ext4: Add DAX functionality > > Documentation/filesystems/Locking | 3 - > Documentation/filesystems/dax.txt | 84 ++++++ > Documentation/filesystems/ext4.txt | 2 + > Documentation/filesystems/xip.txt | 68 ----- > arch/powerpc/sysdev/axonram.c | 8 +- > drivers/block/Kconfig | 13 +- > drivers/block/brd.c | 22 +- > drivers/s390/block/dcssblk.c | 19 +- > fs/Kconfig | 21 +- > fs/Makefile | 1 + > fs/dax.c | 509 +++++++++++++++++++++++++++++++++++++ > fs/exofs/inode.c | 1 - > fs/ext2/Kconfig | 11 - > fs/ext2/Makefile | 1 - > fs/ext2/ext2.h | 9 +- > fs/ext2/file.c | 45 +++- > fs/ext2/inode.c | 37 +-- > fs/ext2/namei.c | 13 +- > fs/ext2/super.c | 48 ++-- > fs/ext2/xip.c | 91 ------- > fs/ext2/xip.h | 26 -- > fs/ext4/ext4.h | 8 +- > fs/ext4/file.c | 53 +++- > fs/ext4/indirect.c | 19 +- > fs/ext4/inode.c | 94 ++++--- > fs/ext4/namei.c | 10 +- > fs/ext4/super.c | 39 ++- > fs/open.c | 5 +- > include/linux/blkdev.h | 4 +- > include/linux/fs.h | 49 +++- > include/linux/mm.h | 2 + > mm/Makefile | 1 - > mm/fadvise.c | 6 +- > mm/filemap.c | 6 +- > mm/filemap_xip.c | 483 ----------------------------------- > mm/madvise.c | 2 +- > mm/memory.c | 45 +++- > 37 files changed, 984 insertions(+), 874 deletions(-) > create mode 100644 Documentation/filesystems/dax.txt > delete mode 100644 Documentation/filesystems/xip.txt > create mode 100644 fs/dax.c > delete mode 100644 fs/ext2/xip.c > delete mode 100644 fs/ext2/xip.h > delete mode 100644 mm/filemap_xip.c > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756177AbaFQSjK (ORCPT ); Tue, 17 Jun 2014 14:39:10 -0400 Received: from mail-pb0-f52.google.com ([209.85.160.52]:46967 "EHLO mail-pb0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754403AbaFQSjI (ORCPT ); Tue, 17 Jun 2014 14:39:08 -0400 Message-ID: <53A08B47.3010701@gmail.com> Date: Tue, 17 Jun 2014 21:39:03 +0300 From: Boaz Harrosh User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Matthew Wilcox CC: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 00/22] Support ext4 on NV-DIMMs References: <53A084E3.6080103@gmail.com> <20140617181925.GF12025@linux.intel.com> In-Reply-To: <20140617181925.GF12025@linux.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/17/2014 09:19 PM, Matthew Wilcox wrote: > On Tue, Jun 17, 2014 at 09:11:47PM +0300, Boaz Harrosh wrote: > > https://github.com/01org/prd should sort you out with both a git tree > and a new block driver. You'll need to tell it manually what address > range to use. I'm using it against regular DIMMs, and this works pretty > well for me since my BIOS doesn't zero DRAM on reset. > God Yes exactly my missing link, Thanks. How I failed to find it? Yes for us too, BIOS doesn't zero DRAM and we can use it with using memmap= on kernel boot. Please include above link in new patchset and Documentation. Just to make the overall picture clearer. BTW what prevents from submitting this prd driver upstream right now? there are devices out there that will need it no? Even for something simple and very smart as putting my ext4 or xfs journal device on nv-dimm, no? The "manually address range to use" is fine in my book. A user-mode udev rule can then be used to cover the gap from sbus or acpi to prd. Hey actually this tree has everything I need. thanks man Boaz From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753733AbaG2MNF (ORCPT ); Tue, 29 Jul 2014 08:13:05 -0400 Received: from mga03.intel.com ([143.182.124.21]:33990 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753317AbaG2MNC (ORCPT ); Tue, 29 Jul 2014 08:13:02 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.01,756,1400050800"; d="scan'208";a="462656253" Date: Tue, 29 Jul 2014 08:12:59 -0400 From: Matthew Wilcox To: Jan Kara Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Message-ID: <20140729121259.GL6754@linux.intel.com> References: <20140409102758.GM32103@quack.suse.cz> <20140409205111.GG5727@linux.intel.com> <20140409214331.GQ32103@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140409214331.GQ32103@quack.suse.cz> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 09, 2014 at 11:43:31PM +0200, Jan Kara wrote: > So there are three places that can fail after we allocate the block: > 1) We race with truncate reducing i_size > 2) dax_get_pfn() fails > 3) vm_insert_mixed() fails > > I would guess that 2) can fail only if the HW has problems and leaking > block in that case could be acceptable (please correct me if I'm wrong). > 3) shouldn't fail because of ENOMEM because fault has already allocated all > the page tables and EBUSY should be handled as well. So the only failure we > have to care about is 1). And we could move ->get_block() call under > i_mmap_mutex after the i_size check. Lock ordering should be fine because > i_mmap_mutex ranks above page lock under which we do block mapping in > standard ->page_mkwrite callbacks. The only (big) drawback is that > i_mmap_mutex will now be held for much longer time and thus the contention > would be much higher. But hopefully once we resolve our problems with > mmap_sem and introduce mapping range lock we could scale reasonably. Lockdep barfs on holding i_mmap_mutex while calling ext4's ->get_block. Path 1: ext4_fallocate -> ext4_punch_hole -> ext4_inode_attach_jinode() -> ... -> lock_map_acquire(&handle->h_lockdep_map); truncate_pagecache_range() -> unmap_mapping_range() -> mutex_lock(&mapping->i_mmap_mutex); Path 2: do_dax_fault() -> mutex_lock(&mapping->i_mmap_mutex); ext4_get_block() -> ... -> lock_map_acquire(&handle->h_lockdep_map); So that idea doesn't work. We can't exclude truncates by incrementing i_dio_count, because we can't take i_mutex in the fault path. I'm stumped. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754207AbaG2VFD (ORCPT ); Tue, 29 Jul 2014 17:05:03 -0400 Received: from cantor2.suse.de ([195.135.220.15]:53209 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751466AbaG2VFB (ORCPT ); Tue, 29 Jul 2014 17:05:01 -0400 Date: Tue, 29 Jul 2014 23:04:57 +0200 From: Jan Kara To: Matthew Wilcox Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Message-ID: <20140729210457.GA17807@quack.suse.cz> References: <20140409102758.GM32103@quack.suse.cz> <20140409205111.GG5727@linux.intel.com> <20140409214331.GQ32103@quack.suse.cz> <20140729121259.GL6754@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140729121259.GL6754@linux.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 29-07-14 08:12:59, Matthew Wilcox wrote: > On Wed, Apr 09, 2014 at 11:43:31PM +0200, Jan Kara wrote: > > So there are three places that can fail after we allocate the block: > > 1) We race with truncate reducing i_size > > 2) dax_get_pfn() fails > > 3) vm_insert_mixed() fails > > > > I would guess that 2) can fail only if the HW has problems and leaking > > block in that case could be acceptable (please correct me if I'm wrong). > > 3) shouldn't fail because of ENOMEM because fault has already allocated all > > the page tables and EBUSY should be handled as well. So the only failure we > > have to care about is 1). And we could move ->get_block() call under > > i_mmap_mutex after the i_size check. Lock ordering should be fine because > > i_mmap_mutex ranks above page lock under which we do block mapping in > > standard ->page_mkwrite callbacks. The only (big) drawback is that > > i_mmap_mutex will now be held for much longer time and thus the contention > > would be much higher. But hopefully once we resolve our problems with > > mmap_sem and introduce mapping range lock we could scale reasonably. > > Lockdep barfs on holding i_mmap_mutex while calling ext4's ->get_block. > > Path 1: > > ext4_fallocate -> > ext4_punch_hole -> > ext4_inode_attach_jinode() -> ... -> > lock_map_acquire(&handle->h_lockdep_map); > truncate_pagecache_range() -> > unmap_mapping_range() -> > mutex_lock(&mapping->i_mmap_mutex); This is strange. I don't see how ext4_inode_attach_jinode() can ever lead to lock_map_acquire(&handle->h_lockdep_map). Can you post a full trace for this? > Path 2: > do_dax_fault() -> > mutex_lock(&mapping->i_mmap_mutex); > ext4_get_block() -> ... -> > lock_map_acquire(&handle->h_lockdep_map); This is obviously correct. Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752248AbaG3VCq (ORCPT ); Wed, 30 Jul 2014 17:02:46 -0400 Received: from mga09.intel.com ([134.134.136.24]:47152 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751708AbaG3VCn (ORCPT ); Wed, 30 Jul 2014 17:02:43 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.01,767,1400050800"; d="scan'208";a="581399893" Date: Wed, 30 Jul 2014 17:02:40 -0400 From: Matthew Wilcox To: Jan Kara Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Message-ID: <20140730210239.GS6754@linux.intel.com> References: <20140409102758.GM32103@quack.suse.cz> <20140409205111.GG5727@linux.intel.com> <20140409214331.GQ32103@quack.suse.cz> <20140729121259.GL6754@linux.intel.com> <20140729210457.GA17807@quack.suse.cz> <20140729212333.GO6754@linux.intel.com> <20140730095229.GA19205@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140730095229.GA19205@quack.suse.cz> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jul 30, 2014 at 11:52:29AM +0200, Jan Kara wrote: > I see the problem now. How about an attached patch? Do you see other > lockdep warnings with it? This patch fixes the problem, thanks! Regardless of DAX, I think this patch should be applied in order to avoid creating a dependency between i_mmap_mutex and jbd2_handle. I've now run into a different problem with COW pages ... more later. > >From c01c905cf3c4c6304a5ea9836389d9cf0d575884 Mon Sep 17 00:00:00 2001 > From: Jan Kara > Date: Wed, 30 Jul 2014 11:49:07 +0200 > Subject: [PATCH] ext4: Avoid lock inversion between i_mmap_mutex and > transaction start > > When DAX is enabled, it uses i_mmap_mutex as a protection against > truncate during page fault. This inevitably forces i_mmap_mutex to rank > outside of a transaction start and thus we have to avoid calling > pagecache purging operations when transaction is started. > > Signed-off-by: Jan Kara > --- > fs/ext4/inode.c | 14 ++++++++++---- > 1 file changed, 10 insertions(+), 4 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 8a064734e6eb..494a8645d63e 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3631,13 +3631,19 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length) > if (IS_SYNC(inode)) > ext4_handle_sync(handle); > > - /* Now release the pages again to reduce race window */ > + inode->i_mtime = inode->i_ctime = ext4_current_time(inode); > + ext4_mark_inode_dirty(handle, inode); > + ext4_journal_stop(handle); > + > + /* > + * Now release the pages again to reduce race window. This has to happen > + * outside of a transaction to avoid lock inversion on i_mmap_mutex > + * when DAX is enabled. > + */ > if (last_block_offset > first_block_offset) > truncate_pagecache_range(inode, first_block_offset, > last_block_offset); > - > - inode->i_mtime = inode->i_ctime = ext4_current_time(inode); > - ext4_mark_inode_dirty(handle, inode); > + goto out_dio; > out_stop: > ext4_journal_stop(handle); > out_dio: > -- > 1.8.1.4 > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751632AbaHILAH (ORCPT ); Sat, 9 Aug 2014 07:00:07 -0400 Received: from mga14.intel.com ([192.55.52.115]:13703 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751418AbaHILAB (ORCPT ); Sat, 9 Aug 2014 07:00:01 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.01,832,1400050800"; d="scan'208";a="574165397" Date: Sat, 9 Aug 2014 07:00:00 -0400 From: Matthew Wilcox To: Jan Kara Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Message-ID: <20140809110000.GA32313@linux.intel.com> References: <20140409102758.GM32103@quack.suse.cz> <20140409205111.GG5727@linux.intel.com> <20140409214331.GQ32103@quack.suse.cz> <20140729121259.GL6754@linux.intel.com> <20140729210457.GA17807@quack.suse.cz> <20140729212333.GO6754@linux.intel.com> <20140730095229.GA19205@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140730095229.GA19205@quack.suse.cz> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jul 30, 2014 at 11:52:29AM +0200, Jan Kara wrote: > I see the problem now. How about an attached patch? Do you see other > lockdep warnings with it? Hit another one :-( Same inversion between i_mmap_mutex and jbd2_handle: -> #1 (&mapping->i_mmap_mutex){+.+...}: [] lock_acquire+0xb2/0x1f0 [] mutex_lock_nested+0x75/0x420 [] rmap_walk+0x6f/0x390 [] page_mkclean+0x69/0x90 [] clear_page_dirty_for_io+0x60/0x120 [] mpage_submit_page+0x47/0x80 [ext4] [] mpage_process_page_bufs+0x110/0x120 [ext4] [] mpage_prepare_extent_to_map+0x1f0/0x2f0 [ext4] [] ext4_writepages+0x427/0x1060 [ext4] [] do_writepages+0x21/0x40 [] __filemap_fdatawrite_range+0x59/0x60 [] filemap_write_and_wait_range+0x2d/0x70 [] ext4_sync_file+0x118/0x490 [ext4] [] vfs_fsync_range+0x1b/0x30 [] SyS_msync+0x1ed/0x250 (ext4_writepages starts a transaction before calling mpage_prepare_extent_to_map) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753003AbaHKIvy (ORCPT ); Mon, 11 Aug 2014 04:51:54 -0400 Received: from cantor2.suse.de ([195.135.220.15]:32812 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752977AbaHKIvv (ORCPT ); Mon, 11 Aug 2014 04:51:51 -0400 Date: Mon, 11 Aug 2014 10:51:47 +0200 From: Jan Kara To: Matthew Wilcox Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Message-ID: <20140811085147.GB29526@quack.suse.cz> References: <20140409102758.GM32103@quack.suse.cz> <20140409205111.GG5727@linux.intel.com> <20140409214331.GQ32103@quack.suse.cz> <20140729121259.GL6754@linux.intel.com> <20140729210457.GA17807@quack.suse.cz> <20140729212333.GO6754@linux.intel.com> <20140730095229.GA19205@quack.suse.cz> <20140809110000.GA32313@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140809110000.GA32313@linux.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat 09-08-14 07:00:00, Matthew Wilcox wrote: > On Wed, Jul 30, 2014 at 11:52:29AM +0200, Jan Kara wrote: > > I see the problem now. How about an attached patch? Do you see other > > lockdep warnings with it? > > Hit another one :-( Same inversion between i_mmap_mutex and jbd2_handle: > > -> #1 (&mapping->i_mmap_mutex){+.+...}: > [] lock_acquire+0xb2/0x1f0 > [] mutex_lock_nested+0x75/0x420 > [] rmap_walk+0x6f/0x390 > [] page_mkclean+0x69/0x90 > [] clear_page_dirty_for_io+0x60/0x120 > [] mpage_submit_page+0x47/0x80 [ext4] > [] mpage_process_page_bufs+0x110/0x120 [ext4] > [] mpage_prepare_extent_to_map+0x1f0/0x2f0 [ext4] > [] ext4_writepages+0x427/0x1060 [ext4] > [] do_writepages+0x21/0x40 > [] __filemap_fdatawrite_range+0x59/0x60 > [] filemap_write_and_wait_range+0x2d/0x70 > [] ext4_sync_file+0x118/0x490 [ext4] > [] vfs_fsync_range+0x1b/0x30 > [] SyS_msync+0x1ed/0x250 > > (ext4_writepages starts a transaction before calling > mpage_prepare_extent_to_map) Hum, yes, this is difficult. Getting rid of clear_page_dirty_for_io() when the transaction is started isn't easily possible :(. So I'm afraid we'll have to find some other way to synchronize page faults and truncate / punch hole in DAX. Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753882AbaHKONM (ORCPT ); Mon, 11 Aug 2014 10:13:12 -0400 Received: from mga11.intel.com ([192.55.52.93]:61788 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753715AbaHKONK (ORCPT ); Mon, 11 Aug 2014 10:13:10 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.01,841,1400050800"; d="scan'208";a="583188451" Date: Mon, 11 Aug 2014 10:13:08 -0400 From: Matthew Wilcox To: Jan Kara Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Message-ID: <20140811141308.GZ6754@linux.intel.com> References: <20140409102758.GM32103@quack.suse.cz> <20140409205111.GG5727@linux.intel.com> <20140409214331.GQ32103@quack.suse.cz> <20140729121259.GL6754@linux.intel.com> <20140729210457.GA17807@quack.suse.cz> <20140729212333.GO6754@linux.intel.com> <20140730095229.GA19205@quack.suse.cz> <20140809110000.GA32313@linux.intel.com> <20140811085147.GB29526@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140811085147.GB29526@quack.suse.cz> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Aug 11, 2014 at 10:51:47AM +0200, Jan Kara wrote: > So I'm afraid we'll have to find some other way to synchronize > page faults and truncate / punch hole in DAX. What if we don't? If we hit the race (which is vanishingly unlikely with real applications), the consequence is simply that after a truncate, a file may be left with one or two blocks allocated somewhere after i_size. As I understand it, that's not a real problem; they're temporarily unavailable for allocation but will be freed on file removal or the next truncation of that file. I'm also still considering the possibility of having truncate-down block until all mmaps that extend after the new i_size have been removed ... From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753958AbaHKOfG (ORCPT ); Mon, 11 Aug 2014 10:35:06 -0400 Received: from cantor2.suse.de ([195.135.220.15]:42032 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753943AbaHKOfE (ORCPT ); Mon, 11 Aug 2014 10:35:04 -0400 Date: Mon, 11 Aug 2014 16:35:00 +0200 From: Jan Kara To: Matthew Wilcox Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Message-ID: <20140811143500.GF29526@quack.suse.cz> References: <20140409102758.GM32103@quack.suse.cz> <20140409205111.GG5727@linux.intel.com> <20140409214331.GQ32103@quack.suse.cz> <20140729121259.GL6754@linux.intel.com> <20140729210457.GA17807@quack.suse.cz> <20140729212333.GO6754@linux.intel.com> <20140730095229.GA19205@quack.suse.cz> <20140809110000.GA32313@linux.intel.com> <20140811085147.GB29526@quack.suse.cz> <20140811141308.GZ6754@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140811141308.GZ6754@linux.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 11-08-14 10:13:08, Matthew Wilcox wrote: > On Mon, Aug 11, 2014 at 10:51:47AM +0200, Jan Kara wrote: > > So I'm afraid we'll have to find some other way to synchronize > > page faults and truncate / punch hole in DAX. > > What if we don't? If we hit the race (which is vanishingly unlikely with > real applications), the consequence is simply that after a truncate, a > file may be left with one or two blocks allocated somewhere after i_size. > As I understand it, that's not a real problem; they're temporarily > unavailable for allocation but will be freed on file removal or the next > truncation of that file. You mean if you won't have any locking between page fault and truncate? You can have: a) extending truncate making forgotten blocks with non-zeros visible b) filesystem corruption due to doubly used blocks (block will be freed from the truncated file and thus can be reallocated but it will still be accessible via mmap from the truncated file). So not a good idea. > I'm also still considering the possibility of having truncate-down block > until all mmaps that extend after the new i_size have been removed ... Hum, I'm not sure how you would do that with current locking scheme and wait for all page faults on that range to finish but maybe you have some good idea :) Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754003AbaHKPCK (ORCPT ); Mon, 11 Aug 2014 11:02:10 -0400 Received: from mga02.intel.com ([134.134.136.20]:41146 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753733AbaHKPCI (ORCPT ); Mon, 11 Aug 2014 11:02:08 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.01,842,1400050800"; d="scan'208";a="556884412" Date: Mon, 11 Aug 2014 11:02:05 -0400 From: Matthew Wilcox To: Jan Kara Cc: Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Message-ID: <20140811150205.GA6754@linux.intel.com> References: <20140409205111.GG5727@linux.intel.com> <20140409214331.GQ32103@quack.suse.cz> <20140729121259.GL6754@linux.intel.com> <20140729210457.GA17807@quack.suse.cz> <20140729212333.GO6754@linux.intel.com> <20140730095229.GA19205@quack.suse.cz> <20140809110000.GA32313@linux.intel.com> <20140811085147.GB29526@quack.suse.cz> <20140811141308.GZ6754@linux.intel.com> <20140811143500.GF29526@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140811143500.GF29526@quack.suse.cz> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Aug 11, 2014 at 04:35:00PM +0200, Jan Kara wrote: > On Mon 11-08-14 10:13:08, Matthew Wilcox wrote: > > On Mon, Aug 11, 2014 at 10:51:47AM +0200, Jan Kara wrote: > > > So I'm afraid we'll have to find some other way to synchronize > > > page faults and truncate / punch hole in DAX. > > > > What if we don't? If we hit the race (which is vanishingly unlikely with > > real applications), the consequence is simply that after a truncate, a > > file may be left with one or two blocks allocated somewhere after i_size. > > As I understand it, that's not a real problem; they're temporarily > > unavailable for allocation but will be freed on file removal or the next > > truncation of that file. > You mean if you won't have any locking between page fault and truncate? > You can have: > a) extending truncate making forgotten blocks with non-zeros visible > b) filesystem corruption due to doubly used blocks (block will be freed > from the truncated file and thus can be reallocated but it will still be > accessible via mmap from the truncated file). > > So not a good idea. Not *no* locking ... just no locking around get_block, like in v7. So check i_size, call get_block, lock i_mmap_mutex, re-check i_size, insert mapping if i_size is OK, drop i_mmap_mutex. As long as get_block() has enough locking of its own against set_size and concurrent calls to get_block(), I don't think we can get visible non-zeroes or double allocation. > > I'm also still considering the possibility of having truncate-down block > > until all mmaps that extend after the new i_size have been removed ... > Hum, I'm not sure how you would do that with current locking scheme and > wait for all page faults on that range to finish but maybe you have some > good idea :) While it can be blocked with i_dio_count currently, this would be a more complicated thing to do ... From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754101AbaHKPZI (ORCPT ); Mon, 11 Aug 2014 11:25:08 -0400 Received: from cantor2.suse.de ([195.135.220.15]:43645 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753549AbaHKPZF (ORCPT ); Mon, 11 Aug 2014 11:25:05 -0400 Date: Mon, 11 Aug 2014 17:25:01 +0200 From: Jan Kara To: Matthew Wilcox Cc: Jan Kara , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Message-ID: <20140811152501.GA12279@quack.suse.cz> References: <20140409214331.GQ32103@quack.suse.cz> <20140729121259.GL6754@linux.intel.com> <20140729210457.GA17807@quack.suse.cz> <20140729212333.GO6754@linux.intel.com> <20140730095229.GA19205@quack.suse.cz> <20140809110000.GA32313@linux.intel.com> <20140811085147.GB29526@quack.suse.cz> <20140811141308.GZ6754@linux.intel.com> <20140811143500.GF29526@quack.suse.cz> <20140811150205.GA6754@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140811150205.GA6754@linux.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 11-08-14 11:02:05, Matthew Wilcox wrote: > On Mon, Aug 11, 2014 at 04:35:00PM +0200, Jan Kara wrote: > > On Mon 11-08-14 10:13:08, Matthew Wilcox wrote: > > > On Mon, Aug 11, 2014 at 10:51:47AM +0200, Jan Kara wrote: > > > > So I'm afraid we'll have to find some other way to synchronize > > > > page faults and truncate / punch hole in DAX. > > > > > > What if we don't? If we hit the race (which is vanishingly unlikely with > > > real applications), the consequence is simply that after a truncate, a > > > file may be left with one or two blocks allocated somewhere after i_size. > > > As I understand it, that's not a real problem; they're temporarily > > > unavailable for allocation but will be freed on file removal or the next > > > truncation of that file. > > You mean if you won't have any locking between page fault and truncate? > > You can have: > > a) extending truncate making forgotten blocks with non-zeros visible > > b) filesystem corruption due to doubly used blocks (block will be freed > > from the truncated file and thus can be reallocated but it will still be > > accessible via mmap from the truncated file). > > > > So not a good idea. > > Not *no* locking ... just no locking around get_block, like in v7. > So check i_size, call get_block, lock i_mmap_mutex, re-check i_size, > insert mapping if i_size is OK, drop i_mmap_mutex. As long as get_block() > has enough locking of its own against set_size and concurrent calls > to get_block(), I don't think we can get visible non-zeroes or double > allocation. Ah, right. Now I remember. Yes, that solution will only occasionally leave allocated blocks beyond EOF. That may be acceptable especially if we mark the file with some flag and truncate those blocks after file is closed in ext4_release_file(). Honza -- Jan Kara SUSE Labs, CR