From: Jan Kara <jack@suse.cz>
To: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Jan Kara <jack@suse.cz>, Dave Chinner <david@fromorbit.com>,
Ross Zwisler <ross.zwisler@linux.intel.com>,
Matthew Wilcox <willy@linux.intel.com>,
linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 2/2] dax: Giant hack
Date: Thu, 28 Jan 2016 14:10:59 +0100 [thread overview]
Message-ID: <20160128131059.GH7726@quack.suse.cz> (raw)
In-Reply-To: <1453867708-3999-3-git-send-email-matthew.r.wilcox@intel.com>
On Tue 26-01-16 23:08:28, Matthew Wilcox wrote:
> From: Matthew Wilcox <willy@linux.intel.com>
>
> This glop of impossible-to-review code implements a number of ideas that
> need to be separated out.
>
> - Eliminate vm_ops->huge_fault. The core calls ->fault instead and callers
> who set VM_HUGEPAGE should be prepared to deal with FAULT_FLAG_SIZE_PMD
> (and larger)
> - Switch back to calling ->page_mkwrite instead of ->pfn_mkwrite. DAX now
> always has a page to lock, and no other imlementations of ->pfn_mkwrite
> exist.
> - dax_mkwrite splits out from dax_fault. dax_fault will now never call
> get_block() to allocate a block; only to see if a block has been allocated.
> dax_mkwrite will always attempt to allocate a block.
> - Filesystems now take their DAX allocation mutex in exclusive/write mode
> when calling dax_mkwrite.
> - Split out dax_insert_pmd_mapping() from dax_pmd_fault and share it with
> the new dax_pmd_mkwrite
> - Change dax_pmd_write to take a vm_fault argument like the rest of the
> family of functions.
I didn't check this in big detail but it looks sound to me and I like how
DAX path is now very similar to non-DAX one and how PMD and PTE faults are
symmetric. Two things I've noticed:
1) There's no need for freeze protection & file_update_time() calls in the
fault handler since we don't to any write operations there anymore. It is
enough to do this in the page_mkwrite() handlers.
2) Nobody uses the complete_unwritten handlers anymore so you can remove
them as a cleanup.
Honza
>
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> ---
> Documentation/filesystems/Locking | 8 -
> Documentation/filesystems/dax.txt | 5 +-
> fs/block_dev.c | 10 +-
> fs/dax.c | 433 +++++++++++++++++++++++++-------------
> fs/ext2/file.c | 35 +--
> fs/ext4/file.c | 96 +++------
> fs/xfs/xfs_file.c | 95 ++-------
> fs/xfs/xfs_trace.h | 2 -
> include/linux/dax.h | 4 +-
> include/linux/mm.h | 4 -
> mm/memory.c | 51 +----
> mm/mmap.c | 2 +-
> 12 files changed, 359 insertions(+), 386 deletions(-)
>
> diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> index 619af9b..1be09e7 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -522,7 +522,6 @@ prototypes:
> void (*close)(struct vm_area_struct*);
> int (*fault)(struct vm_area_struct*, struct vm_fault *);
> int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
> - int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *);
> int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
>
> locking rules:
> @@ -532,7 +531,6 @@ close: yes
> fault: yes can return with page locked
> map_pages: yes
> page_mkwrite: yes can return with page locked
> -pfn_mkwrite: yes
> access: yes
>
> ->fault() is called when a previously not present pte is about
> @@ -559,12 +557,6 @@ the page has been truncated, the filesystem should not look up a new page
> like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
> will cause the VM to retry the fault.
>
> - ->pfn_mkwrite() is the same as page_mkwrite but when the pte is
> -VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
> -VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior
> -after this call is to make the pte read-write, unless pfn_mkwrite returns
> -an error.
> -
> ->access() is called when get_user_pages() fails in
> access_process_vm(), typically used to debug a process through
> /proc/pid/mem or ptrace. This function is needed only for
> diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
> index 2fe9e74..ff62feb 100644
> --- a/Documentation/filesystems/dax.txt
> +++ b/Documentation/filesystems/dax.txt
> @@ -62,9 +62,8 @@ Filesystem support consists of
> dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
> - implementing an mmap file operation for DAX files which sets the
> VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to
> - include handlers for fault, huge_fault and page_mkwrite (which should
> - probably call dax_fault() and dax_mkwrite(), passing the appropriate
> - get_block() callback)
> + include handlers for fault and page_mkwrite (which should probably call
> + dax_fault() and dax_mkwrite(), passing the appropriate get_block() callback)
> - calling dax_truncate_page() instead of block_truncate_page() for DAX files
> - calling dax_zero_page_range() instead of zero_user() for DAX files
> - ensuring that there is sufficient locking between reads, writes,
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index a9474ac..78697fe 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1722,7 +1722,7 @@ static const struct address_space_operations def_blk_aops = {
> *
> * Finally, unlike the filemap_page_mkwrite() case there is no
> * filesystem superblock to sync against freezing. We still include a
> - * pfn_mkwrite callback for dax drivers to receive write fault
> + * page_mkwrite callback for dax drivers to receive write fault
> * notifications.
> */
> static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> @@ -1730,6 +1730,11 @@ static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> return dax_fault(vma, vmf, blkdev_get_block, NULL);
> }
>
> +static int blkdev_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> + return dax_mkwrite(vma, vmf, blkdev_get_block, NULL);
> +}
> +
> static void blkdev_vm_open(struct vm_area_struct *vma)
> {
> struct inode *bd_inode = bdev_file_inode(vma->vm_file);
> @@ -1754,8 +1759,7 @@ static const struct vm_operations_struct blkdev_dax_vm_ops = {
> .open = blkdev_vm_open,
> .close = blkdev_vm_close,
> .fault = blkdev_dax_fault,
> - .huge_fault = blkdev_dax_fault,
> - .pfn_mkwrite = blkdev_dax_fault,
> + .page_mkwrite = blkdev_dax_mkwrite,
> };
>
> static const struct vm_operations_struct blkdev_default_vm_ops = {
> diff --git a/fs/dax.c b/fs/dax.c
> index dbaf62c..952c2c2 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -372,7 +372,7 @@ static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
>
> if (sector == NO_SECTOR) {
> /*
> - * This can happen during correct operation if our pfn_mkwrite
> + * This can happen during correct operation if our page_mkwrite
> * fault raced against a hole punch operation. If this
> * happens the pte that was hole punched will have been
> * unmapped and the radix tree entry will have been removed by
> @@ -584,7 +584,6 @@ static int dax_pte_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> sector_t block;
> pgoff_t size;
> int error;
> - int major = 0;
>
> size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> if (vmf->pgoff >= size)
> @@ -624,20 +623,8 @@ static int dax_pte_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> if (error)
> goto unlock_page;
>
> - if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) {
> - if (vmf->flags & FAULT_FLAG_WRITE) {
> - error = get_block(inode, block, &bh, 1);
> - count_vm_event(PGMAJFAULT);
> - mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> - major = VM_FAULT_MAJOR;
> - if (!error && (bh.b_size < PAGE_SIZE))
> - error = -EIO;
> - if (error)
> - goto unlock_page;
> - } else {
> - return dax_load_hole(mapping, page, vmf);
> - }
> - }
> + if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page)
> + return dax_load_hole(mapping, page, vmf);
>
> if (vmf->cow_page) {
> struct page *new_page = vmf->cow_page;
> @@ -655,16 +642,101 @@ static int dax_pte_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> PAGE_SHIFT;
> if (vmf->pgoff >= size) {
> i_mmap_unlock_read(mapping);
> - error = -EIO;
> - goto out;
> + return VM_FAULT_SIGBUS;
> }
> }
> return VM_FAULT_LOCKED;
> }
>
> - /* Check we didn't race with a read fault installing a new page */
> - if (!page && major)
> - page = find_lock_page(mapping, vmf->pgoff);
> + /*
> + * If we successfully insert a mapping to an unwritten extent, we
> + * need to convert the unwritten extent. If there is an error
> + * inserting the mapping, the filesystem needs to leave it as
> + * unwritten to prevent exposure of the stale underlying data to
> + * userspace, but we still need to call the completion function so
> + * the private resources on the mapping buffer can be released. We
> + * indicate what the callback should do via the uptodate variable,
> + * same as for normal BH based IO completions.
> + */
> + error = dax_insert_mapping(inode, &bh, vma, vmf);
> + if (buffer_unwritten(&bh)) {
> + if (complete_unwritten)
> + complete_unwritten(&bh, !error);
> + else
> + WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE));
> + }
> +
> + out:
> + if (error == -ENOMEM)
> + return VM_FAULT_OOM;
> + /* -EBUSY is fine, somebody else faulted on the same PTE */
> + if ((error < 0) && (error != -EBUSY))
> + return VM_FAULT_SIGBUS;
> + return VM_FAULT_NOPAGE;
> +
> + unlock_page:
> + if (page) {
> + unlock_page(page);
> + page_cache_release(page);
> + }
> + goto out;
> +}
> +
> +static int dax_pte_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> + get_block_t get_block, dax_iodone_t complete_unwritten)
> +{
> + struct file *file = vma->vm_file;
> + struct address_space *mapping = file->f_mapping;
> + struct inode *inode = mapping->host;
> + struct page *page;
> + struct buffer_head bh;
> + unsigned blkbits = inode->i_blkbits;
> + sector_t block;
> + pgoff_t size;
> + int error;
> + int major = 0;
> +
> + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + if (vmf->pgoff >= size)
> + return VM_FAULT_SIGBUS;
> +
> + memset(&bh, 0, sizeof(bh));
> + block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits);
> + bh.b_bdev = inode->i_sb->s_bdev;
> + bh.b_size = PAGE_SIZE;
> +
> + repeat:
> + page = find_get_page(mapping, vmf->pgoff);
> + if (page) {
> + if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
> + page_cache_release(page);
> + return VM_FAULT_RETRY;
> + }
> + if (unlikely(page->mapping != mapping)) {
> + unlock_page(page);
> + page_cache_release(page);
> + goto repeat;
> + }
> + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + if (unlikely(vmf->pgoff >= size)) {
> + /*
> + * We have a struct page covering a hole in the file
> + * from a read fault and we've raced with a truncate
> + */
> + error = -EIO;
> + goto unlock_page;
> + }
> + }
> +
> + error = get_block(inode, block, &bh, 1);
> + if (!error && (bh.b_size < PAGE_SIZE))
> + error = -EIO; /* fs corruption? */
> + if (error)
> + goto unlock_page;
> +
> + count_vm_event(PGMAJFAULT);
> + mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> + major = VM_FAULT_MAJOR;
>
> if (page) {
> unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
> @@ -675,16 +747,6 @@ static int dax_pte_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> page = NULL;
> }
>
> - /*
> - * If we successfully insert the new mapping over an unwritten extent,
> - * we need to ensure we convert the unwritten extent. If there is an
> - * error inserting the mapping, the filesystem needs to leave it as
> - * unwritten to prevent exposure of the stale underlying data to
> - * userspace, but we still need to call the completion function so
> - * the private resources on the mapping buffer can be released. We
> - * indicate what the callback should do via the uptodate variable, same
> - * as for normal BH based IO completions.
> - */
> error = dax_insert_mapping(inode, &bh, vma, vmf);
> if (buffer_unwritten(&bh)) {
> if (complete_unwritten)
> @@ -734,22 +796,101 @@ static void __dax_dbg(struct buffer_head *bh, unsigned long address,
>
> #define dax_pmd_dbg(bh, address, reason) __dax_dbg(bh, address, reason, "dax_pmd")
>
> -static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> - pmd_t *pmd, unsigned int flags, get_block_t get_block,
> - dax_iodone_t complete_unwritten)
> +static int dax_insert_pmd_mapping(struct inode *inode, struct buffer_head *bh,
> + struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> + struct blk_dax_ctl dax = {
> + .sector = to_sector(bh, inode),
> + .size = PMD_SIZE,
> + };
> + struct block_device *bdev = bh->b_bdev;
> + bool write = vmf->flags & FAULT_FLAG_WRITE;
> + unsigned long address = (unsigned long)vmf->virtual_address;
> + pgoff_t pgoff = linear_page_index(vma, address & PMD_MASK);
> + int major;
> + long length;
> +
> + length = dax_map_atomic(bdev, &dax);
> + if (length < 0)
> + return VM_FAULT_SIGBUS;
> + if (length < PMD_SIZE) {
> + dax_pmd_dbg(bh, address, "dax-length too small");
> + goto unmap;
> + }
> +
> + if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR) {
> + dax_pmd_dbg(bh, address, "pfn unaligned");
> + goto unmap;
> + }
> +
> + if (!pfn_t_devmap(dax.pfn)) {
> + dax_pmd_dbg(bh, address, "pfn not in memmap");
> + goto unmap;
> + }
> +
> + if (buffer_unwritten(bh) || buffer_new(bh)) {
> + clear_pmem(dax.addr, PMD_SIZE);
> + wmb_pmem();
> + count_vm_event(PGMAJFAULT);
> + mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> + major = VM_FAULT_MAJOR;
> + } else
> + major = 0;
> +
> + dax_unmap_atomic(bdev, &dax);
> +
> + /*
> + * For PTE faults we insert a radix tree entry for reads, and
> + * leave it clean. Then on the first write we dirty the radix
> + * tree entry via the dax_mkwrite() path. This sequence
> + * allows the dax_mkwrite() call to be simpler and avoid a
> + * call into get_block() to translate the pgoff to a sector in
> + * order to be able to create a new radix tree entry.
> + *
> + * The PMD path doesn't have an equivalent to
> + * dax_mkwrite(), though, so for a read followed by a
> + * write we traverse all the way through __dax_pmd_fault()
> + * twice. This means we can just skip inserting a radix tree
> + * entry completely on the initial read and just wait until
> + * the write to insert a dirty entry.
> + */
> + if (write) {
> + int error = dax_radix_entry(vma->vm_file->f_mapping, pgoff,
> + dax.sector, true, true);
> + if (error) {
> + dax_pmd_dbg(bh, address, "PMD radix insertion failed");
> + goto fallback;
> + }
> + }
> +
> + dev_dbg(part_to_dev(bdev->bd_part),
> + "%s: %s addr: %lx pfn: %lx sect: %llx\n",
> + __func__, current->comm, address,
> + pfn_t_to_pfn(dax.pfn),
> + (unsigned long long) dax.sector);
> + return major | vmf_insert_pfn_pmd(vma, address, vmf->pmd, dax.pfn,
> + write);
> +
> + unmap:
> + dax_unmap_atomic(bdev, &dax);
> + fallback:
> + return VM_FAULT_FALLBACK;
> +}
> +
> +static int dax_pmd_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> + get_block_t get_block, dax_iodone_t complete_unwritten)
> {
> struct file *file = vma->vm_file;
> struct address_space *mapping = file->f_mapping;
> struct inode *inode = mapping->host;
> struct buffer_head bh;
> unsigned blkbits = inode->i_blkbits;
> + unsigned long address = (unsigned long)vmf->virtual_address;
> unsigned long pmd_addr = address & PMD_MASK;
> - bool write = flags & FAULT_FLAG_WRITE;
> - struct block_device *bdev;
> + bool write = vmf->flags & FAULT_FLAG_WRITE;
> pgoff_t size, pgoff;
> sector_t block;
> - int error, result = 0;
> - bool alloc = false;
> + int result = 0;
>
> /* dax pmd mappings require pfn_t_devmap() */
> if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
> @@ -757,7 +898,7 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>
> /* Fall back to PTEs if we're going to COW */
> if (write && !(vma->vm_flags & VM_SHARED)) {
> - split_huge_pmd(vma, pmd, address);
> + split_huge_pmd(vma, vmf->pmd, address);
> dax_pmd_dbg(NULL, address, "cow write");
> return VM_FAULT_FALLBACK;
> }
> @@ -791,14 +932,6 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> if (get_block(inode, block, &bh, 0) != 0)
> return VM_FAULT_SIGBUS;
>
> - if (!buffer_mapped(&bh) && write) {
> - if (get_block(inode, block, &bh, 1) != 0)
> - return VM_FAULT_SIGBUS;
> - alloc = true;
> - }
> -
> - bdev = bh.b_bdev;
> -
> /*
> * If the filesystem isn't willing to tell us the length of a hole,
> * just fall back to PTEs. Calling get_block 512 times in a loop
> @@ -809,17 +942,6 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> return VM_FAULT_FALLBACK;
> }
>
> - /*
> - * If we allocated new storage, make sure no process has any
> - * zero pages covering this hole
> - */
> - if (alloc) {
> - loff_t lstart = pgoff << PAGE_SHIFT;
> - loff_t lend = lstart + PMD_SIZE - 1; /* inclusive */
> -
> - truncate_pagecache_range(inode, lstart, lend);
> - }
> -
> i_mmap_lock_read(mapping);
>
> /*
> @@ -839,9 +961,9 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> goto fallback;
> }
>
> - if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
> + if (!buffer_mapped(&bh) && buffer_uptodate(&bh)) {
> spinlock_t *ptl;
> - pmd_t entry;
> + pmd_t entry, *pmd = vmf->pmd;
> struct page *zero_page = get_huge_zero_page();
>
> if (unlikely(!zero_page)) {
> @@ -856,7 +978,7 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> goto fallback;
> }
>
> - dev_dbg(part_to_dev(bdev->bd_part),
> + dev_dbg(part_to_dev(bh.b_bdev->bd_part),
> "%s: %s addr: %lx pfn: <zero> sect: %llx\n",
> __func__, current->comm, address,
> (unsigned long long) to_sector(&bh, inode));
> @@ -867,75 +989,90 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> result = VM_FAULT_NOPAGE;
> spin_unlock(ptl);
> } else {
> - struct blk_dax_ctl dax = {
> - .sector = to_sector(&bh, inode),
> - .size = PMD_SIZE,
> - };
> - long length = dax_map_atomic(bdev, &dax);
> + result |= dax_insert_pmd_mapping(inode, &bh, vma, vmf);
> + }
>
> - if (length < 0) {
> - result = VM_FAULT_SIGBUS;
> - goto out;
> - }
> - if (length < PMD_SIZE) {
> - dax_pmd_dbg(&bh, address, "dax-length too small");
> - dax_unmap_atomic(bdev, &dax);
> - goto fallback;
> - }
> - if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR) {
> - dax_pmd_dbg(&bh, address, "pfn unaligned");
> - dax_unmap_atomic(bdev, &dax);
> - goto fallback;
> - }
> + out:
> + i_mmap_unlock_read(mapping);
>
> - if (!pfn_t_devmap(dax.pfn)) {
> - dax_unmap_atomic(bdev, &dax);
> - dax_pmd_dbg(&bh, address, "pfn not in memmap");
> - goto fallback;
> - }
> + if (buffer_unwritten(&bh))
> + complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
>
> - if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> - clear_pmem(dax.addr, PMD_SIZE);
> - wmb_pmem();
> - count_vm_event(PGMAJFAULT);
> - mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> - result |= VM_FAULT_MAJOR;
> - }
> - dax_unmap_atomic(bdev, &dax);
> + return result;
>
> - /*
> - * For PTE faults we insert a radix tree entry for reads, and
> - * leave it clean. Then on the first write we dirty the radix
> - * tree entry via the dax_pfn_mkwrite() path. This sequence
> - * allows the dax_pfn_mkwrite() call to be simpler and avoid a
> - * call into get_block() to translate the pgoff to a sector in
> - * order to be able to create a new radix tree entry.
> - *
> - * The PMD path doesn't have an equivalent to
> - * dax_pfn_mkwrite(), though, so for a read followed by a
> - * write we traverse all the way through __dax_pmd_fault()
> - * twice. This means we can just skip inserting a radix tree
> - * entry completely on the initial read and just wait until
> - * the write to insert a dirty entry.
> - */
> - if (write) {
> - error = dax_radix_entry(mapping, pgoff, dax.sector,
> - true, true);
> - if (error) {
> - dax_pmd_dbg(&bh, address,
> - "PMD radix insertion failed");
> - goto fallback;
> - }
> - }
> + fallback:
> + count_vm_event(THP_FAULT_FALLBACK);
> + result = VM_FAULT_FALLBACK;
> + goto out;
> +}
>
> - dev_dbg(part_to_dev(bdev->bd_part),
> - "%s: %s addr: %lx pfn: %lx sect: %llx\n",
> - __func__, current->comm, address,
> - pfn_t_to_pfn(dax.pfn),
> - (unsigned long long) dax.sector);
> - result |= vmf_insert_pfn_pmd(vma, address, pmd,
> - dax.pfn, write);
> +static int dax_pmd_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> + get_block_t get_block, dax_iodone_t complete_unwritten)
> +{
> + struct file *file = vma->vm_file;
> + struct address_space *mapping = file->f_mapping;
> + struct inode *inode = mapping->host;
> + unsigned long address = (unsigned long)vmf->virtual_address;
> + struct buffer_head bh;
> + unsigned blkbits = inode->i_blkbits;
> + unsigned long pmd_addr = address & PMD_MASK;
> + struct block_device *bdev;
> + pgoff_t size, pgoff;
> + loff_t lstart, lend;
> + sector_t block;
> + int result = 0;
> +
> + pgoff = linear_page_index(vma, pmd_addr);
> + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + if (pgoff >= size)
> + return VM_FAULT_SIGBUS;
> +
> + memset(&bh, 0, sizeof(bh));
> + bh.b_bdev = inode->i_sb->s_bdev;
> + block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
> +
> + bh.b_size = PMD_SIZE;
> +
> + if (get_block(inode, block, &bh, 1) != 0)
> + return VM_FAULT_SIGBUS;
> +
> + bdev = bh.b_bdev;
> +
> + /*
> + * If the filesystem isn't willing to tell us the length of a hole,
> + * just fall back to PTEs. Calling get_block 512 times in a loop
> + * would be silly.
> + */
> + if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
> + dax_pmd_dbg(&bh, address, "allocated block too small");
> + return VM_FAULT_FALLBACK;
> + }
> +
> + /* Make sure no process has any zero pages covering this hole */
> + lstart = pgoff << PAGE_SHIFT;
> + lend = lstart + PMD_SIZE - 1; /* inclusive */
> + truncate_pagecache_range(inode, lstart, lend);
> +
> + i_mmap_lock_read(mapping);
> +
> + /*
> + * If a truncate happened while we were allocating blocks, we may
> + * leave blocks allocated to the file that are beyond EOF. We can't
> + * take i_mutex here, so just leave them hanging; they'll be freed
> + * when the file is deleted.
> + */
> + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + if (pgoff >= size) {
> + result = VM_FAULT_SIGBUS;
> + goto out;
> }
> + if ((pgoff | PG_PMD_COLOUR) >= size) {
> + dax_pmd_dbg(&bh, address,
> + "offset + huge page size > file size");
> + goto fallback;
> + }
> +
> + result |= dax_insert_pmd_mapping(inode, &bh, vma, vmf);
>
> out:
> i_mmap_unlock_read(mapping);
> @@ -951,9 +1088,13 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> goto out;
> }
> #else /* !CONFIG_TRANSPARENT_HUGEPAGE */
> -static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> - pmd_t *pmd, unsigned int flags, get_block_t get_block,
> - dax_iodone_t complete_unwritten)
> +static int dax_pmd_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> + get_block_t get_block, dax_iodone_t complete_unwritten)
> +{
> + return VM_FAULT_FALLBACK;
> +}
> +static int dax_pmd_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> + get_block_t get_block, dax_iodone_t complete_unwritten)
> {
> return VM_FAULT_FALLBACK;
> }
> @@ -978,13 +1119,11 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> get_block_t get_block, dax_iodone_t iodone)
> {
> - unsigned long address = (unsigned long)vmf->virtual_address;
> switch (vmf->flags & FAULT_FLAG_SIZE_MASK) {
> case FAULT_FLAG_SIZE_PTE:
> return dax_pte_fault(vma, vmf, get_block, iodone);
> case FAULT_FLAG_SIZE_PMD:
> - return dax_pmd_fault(vma, address, vmf->pmd, vmf->flags,
> - get_block, iodone);
> + return dax_pmd_fault(vma, vmf, get_block, iodone);
> default:
> return VM_FAULT_FALLBACK;
> }
> @@ -992,26 +1131,30 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> EXPORT_SYMBOL_GPL(dax_fault);
>
> /**
> - * dax_pfn_mkwrite - handle first write to DAX page
> + * dax_mkwrite - handle first write to a DAX page
> * @vma: The virtual memory area where the fault occurred
> * @vmf: The description of the fault
> + * @get_block: The filesystem method used to translate file offsets to blocks
> + * @iodone: The filesystem method used to convert unwritten blocks
> + * to written so the data written to them is exposed. This is required
> + * by write faults for filesystems that will return unwritten extent
> + * mappings from @get_block, but it is optional for reads as
> + * dax_insert_mapping() will always zero unwritten blocks. If the fs
> + * does not support unwritten extents, then it should pass NULL.
> */
> -int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> +int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> + get_block_t get_block, dax_iodone_t iodone)
> {
> - struct file *file = vma->vm_file;
> -
> - /*
> - * We pass NO_SECTOR to dax_radix_entry() because we expect that a
> - * RADIX_DAX_PTE entry already exists in the radix tree from a
> - * previous call to __dax_fault(). We just want to look up that PTE
> - * entry using vmf->pgoff and make sure the dirty tag is set. This
> - * saves us from having to make a call to get_block() here to look
> - * up the sector.
> - */
> - dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
> - return VM_FAULT_NOPAGE;
> + switch (vmf->flags & FAULT_FLAG_SIZE_MASK) {
> + case FAULT_FLAG_SIZE_PTE:
> + return dax_pte_mkwrite(vma, vmf, get_block, iodone);
> + case FAULT_FLAG_SIZE_PMD:
> + return dax_pmd_mkwrite(vma, vmf, get_block, iodone);
> + default:
> + return VM_FAULT_FALLBACK;
> + }
> }
> -EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
> +EXPORT_SYMBOL_GPL(dax_mkwrite);
>
> /**
> * dax_zero_page_range - zero a range within a page of a DAX file
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index cf6f78c..6028c63 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -49,13 +49,14 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> sb_start_pagefault(inode->i_sb);
> file_update_time(vma->vm_file);
> }
> - down_read(&ei->dax_sem);
>
> + down_read(&ei->dax_sem);
> ret = dax_fault(vma, vmf, ext2_get_block, NULL);
> -
> up_read(&ei->dax_sem);
> +
> if (vmf->flags & FAULT_FLAG_WRITE)
> sb_end_pagefault(inode->i_sb);
> +
> return ret;
> }
>
> @@ -67,44 +68,18 @@ static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>
> sb_start_pagefault(inode->i_sb);
> file_update_time(vma->vm_file);
> - down_read(&ei->dax_sem);
>
> + down_write(&ei->dax_sem);
> ret = dax_mkwrite(vma, vmf, ext2_get_block, NULL);
> + up_write(&ei->dax_sem);
>
> - up_read(&ei->dax_sem);
> - sb_end_pagefault(inode->i_sb);
> - return ret;
> -}
> -
> -static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
> - struct vm_fault *vmf)
> -{
> - struct inode *inode = file_inode(vma->vm_file);
> - struct ext2_inode_info *ei = EXT2_I(inode);
> - loff_t size;
> - int ret;
> -
> - sb_start_pagefault(inode->i_sb);
> - file_update_time(vma->vm_file);
> - down_read(&ei->dax_sem);
> -
> - /* check that the faulting page hasn't raced with truncate */
> - size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> - if (vmf->pgoff >= size)
> - ret = VM_FAULT_SIGBUS;
> - else
> - ret = dax_pfn_mkwrite(vma, vmf);
> -
> - up_read(&ei->dax_sem);
> sb_end_pagefault(inode->i_sb);
> return ret;
> }
>
> static const struct vm_operations_struct ext2_dax_vm_ops = {
> .fault = ext2_dax_fault,
> - .huge_fault = ext2_dax_fault,
> .page_mkwrite = ext2_dax_mkwrite,
> - .pfn_mkwrite = ext2_dax_pfn_mkwrite,
> };
>
> static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 71859ed..72dcece 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -196,99 +196,65 @@ out:
> static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> {
> int result;
> - handle_t *handle = NULL;
> struct inode *inode = file_inode(vma->vm_file);
> struct super_block *sb = inode->i_sb;
> bool write = vmf->flags & FAULT_FLAG_WRITE;
>
> if (write) {
> - unsigned nblocks;
> - switch (vmf->flags & FAULT_FLAG_SIZE_MASK) {
> - case FAULT_FLAG_SIZE_PTE:
> - nblocks = EXT4_DATA_TRANS_BLOCKS(sb);
> - break;
> - case FAULT_FLAG_SIZE_PMD:
> - nblocks = ext4_chunk_trans_blocks(inode,
> - PMD_SIZE / PAGE_SIZE);
> - break;
> - default:
> - return VM_FAULT_FALLBACK;
> - }
> -
> sb_start_pagefault(sb);
> file_update_time(vma->vm_file);
> - down_read(&EXT4_I(inode)->i_mmap_sem);
> - handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE, nblocks);
> - } else
> - down_read(&EXT4_I(inode)->i_mmap_sem);
> + }
>
> - if (IS_ERR(handle))
> - result = VM_FAULT_SIGBUS;
> - else
> - result = dax_fault(vma, vmf, ext4_dax_mmap_get_block, NULL);
> + down_read(&EXT4_I(inode)->i_mmap_sem);
> + result = dax_fault(vma, vmf, ext4_dax_mmap_get_block, NULL);
> + up_read(&EXT4_I(inode)->i_mmap_sem);
>
> - if (write) {
> - if (!IS_ERR(handle))
> - ext4_journal_stop(handle);
> - up_read(&EXT4_I(inode)->i_mmap_sem);
> + if (write)
> sb_end_pagefault(sb);
> - } else
> - up_read(&EXT4_I(inode)->i_mmap_sem);
>
> return result;
> }
>
> static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> {
> - int err;
> + int result;
> struct inode *inode = file_inode(vma->vm_file);
> + struct super_block *sb = inode->i_sb;
> + handle_t *handle;
> + unsigned nblocks;
> +
> + switch (vmf->flags & FAULT_FLAG_SIZE_MASK) {
> + case FAULT_FLAG_SIZE_PTE:
> + nblocks = EXT4_DATA_TRANS_BLOCKS(sb);
> + break;
> + case FAULT_FLAG_SIZE_PMD:
> + nblocks = ext4_chunk_trans_blocks(inode, PMD_SIZE / PAGE_SIZE);
> + break;
> + default:
> + return VM_FAULT_FALLBACK;
> + }
>
> sb_start_pagefault(inode->i_sb);
> file_update_time(vma->vm_file);
> - down_read(&EXT4_I(inode)->i_mmap_sem);
> - err = dax_mkwrite(vma, vmf, ext4_dax_mmap_get_block, NULL);
> - up_read(&EXT4_I(inode)->i_mmap_sem);
> - sb_end_pagefault(inode->i_sb);
> -
> - return err;
> -}
>
> -/*
> - * Handle write fault for VM_MIXEDMAP mappings. Similarly to ext4_dax_mkwrite()
> - * handler we check for races agaist truncate. Note that since we cycle through
> - * i_mmap_sem, we are sure that also any hole punching that began before we
> - * were called is finished by now and so if it included part of the file we
> - * are working on, our pte will get unmapped and the check for pte_same() in
> - * wp_pfn_shared() fails. Thus fault gets retried and things work out as
> - * desired.
> - */
> -static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
> - struct vm_fault *vmf)
> -{
> - struct inode *inode = file_inode(vma->vm_file);
> - struct super_block *sb = inode->i_sb;
> - loff_t size;
> - int ret;
> + handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE, nblocks);
> + if (IS_ERR(handle)) {
> + result = VM_FAULT_SIGBUS;
> + } else {
> + down_write(&EXT4_I(inode)->i_mmap_sem);
> + result = dax_mkwrite(vma, vmf, ext4_dax_mmap_get_block, NULL);
> + up_write(&EXT4_I(inode)->i_mmap_sem);
> + ext4_journal_stop(handle);
> + }
>
> - sb_start_pagefault(sb);
> - file_update_time(vma->vm_file);
> - down_read(&EXT4_I(inode)->i_mmap_sem);
> - size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> - if (vmf->pgoff >= size)
> - ret = VM_FAULT_SIGBUS;
> - else
> - ret = dax_pfn_mkwrite(vma, vmf);
> - up_read(&EXT4_I(inode)->i_mmap_sem);
> - sb_end_pagefault(sb);
> + sb_end_pagefault(inode->i_sb);
>
> - return ret;
> + return result;
> }
>
> static const struct vm_operations_struct ext4_dax_vm_ops = {
> .fault = ext4_dax_fault,
> - .huge_fault = ext4_dax_fault,
> .page_mkwrite = ext4_dax_mkwrite,
> - .pfn_mkwrite = ext4_dax_pfn_mkwrite,
> };
> #else
> #define ext4_dax_vm_ops ext4_file_vm_ops
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 6db703b..f51f09a 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1517,22 +1517,25 @@ xfs_filemap_page_mkwrite(
> struct vm_fault *vmf)
> {
> struct inode *inode = file_inode(vma->vm_file);
> + struct xfs_inode *ip = XFS_I(inode);
> int ret;
>
> - trace_xfs_filemap_page_mkwrite(XFS_I(inode));
> + trace_xfs_filemap_page_mkwrite(ip);
>
> sb_start_pagefault(inode->i_sb);
> file_update_time(vma->vm_file);
> - xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>
> if (IS_DAX(inode)) {
> + xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> ret = dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault, NULL);
> + xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> } else {
> + xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> ret = block_page_mkwrite(vma, vmf, xfs_get_blocks);
> ret = block_page_mkwrite_return(ret);
> + xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> }
>
> - xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> sb_end_pagefault(inode->i_sb);
>
> return ret;
> @@ -1544,15 +1547,17 @@ xfs_filemap_fault(
> struct vm_fault *vmf)
> {
> struct inode *inode = file_inode(vma->vm_file);
> + struct xfs_inode *ip = XFS_I(inode);
> int ret;
>
> - trace_xfs_filemap_fault(XFS_I(inode));
> + trace_xfs_filemap_fault(ip);
>
> - /* DAX can shortcut the normal fault path on write faults! */
> - if ((vmf->flags & FAULT_FLAG_WRITE) && IS_DAX(inode))
> - return xfs_filemap_page_mkwrite(vma, vmf);
> + if (IS_DAX(inode) && vmf->flags & FAULT_FLAG_WRITE) {
> + sb_start_pagefault(inode->i_sb);
> + file_update_time(vma->vm_file);
> + }
>
> - xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> + xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> if (IS_DAX(inode)) {
> /*
> * we do not want to trigger unwritten extent conversion on read
> @@ -1563,88 +1568,18 @@ xfs_filemap_fault(
> ret = dax_fault(vma, vmf, xfs_get_blocks_dax_fault, NULL);
> } else
> ret = filemap_fault(vma, vmf);
> - xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> -
> - return ret;
> -}
> -
> -/*
> - * Similar to xfs_filemap_fault(), the DAX fault path can call into here on
> - * both read and write faults. Hence we need to handle both cases. There is no
> - * ->huge_mkwrite callout for huge pages, so we have a single function here to
> - * handle both cases here. @flags carries the information on the type of fault
> - * occuring.
> - */
> -STATIC int
> -xfs_filemap_huge_fault(
> - struct vm_area_struct *vma,
> - struct vm_fault *vmf)
> -{
> - struct inode *inode = file_inode(vma->vm_file);
> - struct xfs_inode *ip = XFS_I(inode);
> - int ret;
> -
> - if (!IS_DAX(inode))
> - return VM_FAULT_FALLBACK;
> -
> - trace_xfs_filemap_huge_fault(ip);
> -
> - if (vmf->flags & FAULT_FLAG_WRITE) {
> - sb_start_pagefault(inode->i_sb);
> - file_update_time(vma->vm_file);
> - }
> -
> - xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> - ret = dax_fault(vma, vmf, xfs_get_blocks_dax_fault, NULL);
> - xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> + xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
>
> - if (vmf->flags & FAULT_FLAG_WRITE)
> + if (IS_DAX(inode) && vmf->flags & FAULT_FLAG_WRITE)
> sb_end_pagefault(inode->i_sb);
>
> return ret;
> }
>
> -/*
> - * pfn_mkwrite was originally intended to ensure we capture time stamp
> - * updates on write faults. In reality, it's need to serialise against
> - * truncate similar to page_mkwrite. Hence we cycle the XFS_MMAPLOCK_SHARED
> - * to ensure we serialise the fault barrier in place.
> - */
> -static int
> -xfs_filemap_pfn_mkwrite(
> - struct vm_area_struct *vma,
> - struct vm_fault *vmf)
> -{
> -
> - struct inode *inode = file_inode(vma->vm_file);
> - struct xfs_inode *ip = XFS_I(inode);
> - int ret = VM_FAULT_NOPAGE;
> - loff_t size;
> -
> - trace_xfs_filemap_pfn_mkwrite(ip);
> -
> - sb_start_pagefault(inode->i_sb);
> - file_update_time(vma->vm_file);
> -
> - /* check if the faulting page hasn't raced with truncate */
> - xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> - size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> - if (vmf->pgoff >= size)
> - ret = VM_FAULT_SIGBUS;
> - else if (IS_DAX(inode))
> - ret = dax_pfn_mkwrite(vma, vmf);
> - xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> - sb_end_pagefault(inode->i_sb);
> - return ret;
> -
> -}
> -
> static const struct vm_operations_struct xfs_file_vm_ops = {
> .fault = xfs_filemap_fault,
> - .huge_fault = xfs_filemap_huge_fault,
> .map_pages = filemap_map_pages,
> .page_mkwrite = xfs_filemap_page_mkwrite,
> - .pfn_mkwrite = xfs_filemap_pfn_mkwrite,
> };
>
> STATIC int
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index fb1f3e1..3f1515f 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -687,9 +687,7 @@ DEFINE_INODE_EVENT(xfs_inode_clear_eofblocks_tag);
> DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
>
> DEFINE_INODE_EVENT(xfs_filemap_fault);
> -DEFINE_INODE_EVENT(xfs_filemap_huge_fault);
> DEFINE_INODE_EVENT(xfs_filemap_page_mkwrite);
> -DEFINE_INODE_EVENT(xfs_filemap_pfn_mkwrite);
>
> DECLARE_EVENT_CLASS(xfs_iref_class,
> TP_PROTO(struct xfs_inode *ip, unsigned long caller_ip),
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 8e58c36..b9e745f 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -12,8 +12,8 @@ int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
> int dax_truncate_page(struct inode *, loff_t from, get_block_t);
> int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
> dax_iodone_t);
> -int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
> -#define dax_mkwrite(vma, vmf, gb, iod) dax_fault(vma, vmf, gb, iod)
> +int dax_mkwrite(struct vm_area_struct *, struct vm_fault *, get_block_t,
> + dax_iodone_t);
>
> static inline bool vma_is_dax(struct vm_area_struct *vma)
> {
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b9d0979..eac1aeb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -281,16 +281,12 @@ struct vm_operations_struct {
> void (*close)(struct vm_area_struct * area);
> int (*mremap)(struct vm_area_struct * area);
> int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
> - int (*huge_fault)(struct vm_area_struct *, struct vm_fault *vmf);
> void (*map_pages)(struct vm_area_struct *vma, struct vm_fault *vmf);
>
> /* notification that a previously read-only page is about to become
> * writable, if an error is returned it will cause a SIGBUS */
> int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
>
> - /* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
> - int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
> -
> /* called by access_process_vm when get_user_pages() fails, typically
> * for use by special VMAs that can switch between memory and hardware
> */
> diff --git a/mm/memory.c b/mm/memory.c
> index 03d49eb..0af34e2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2210,42 +2210,6 @@ oom:
> return VM_FAULT_OOM;
> }
>
> -/*
> - * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED
> - * mapping
> - */
> -static int wp_pfn_shared(struct mm_struct *mm,
> - struct vm_area_struct *vma, unsigned long address,
> - pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
> - pmd_t *pmd)
> -{
> - if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
> - struct vm_fault vmf = {
> - .page = NULL,
> - .pgoff = linear_page_index(vma, address),
> - .virtual_address = (void __user *)(address & PAGE_MASK),
> - .flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
> - };
> - int ret;
> -
> - pte_unmap_unlock(page_table, ptl);
> - ret = vma->vm_ops->pfn_mkwrite(vma, &vmf);
> - if (ret & VM_FAULT_ERROR)
> - return ret;
> - page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> - /*
> - * We might have raced with another page fault while we
> - * released the pte_offset_map_lock.
> - */
> - if (!pte_same(*page_table, orig_pte)) {
> - pte_unmap_unlock(page_table, ptl);
> - return 0;
> - }
> - }
> - return wp_page_reuse(mm, vma, address, page_table, ptl, orig_pte,
> - NULL, 0, 0);
> -}
> -
> static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long address, pte_t *page_table,
> pmd_t *pmd, spinlock_t *ptl, pte_t orig_pte,
> @@ -2324,12 +2288,13 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> * VM_PFNMAP VMA.
> *
> * We should not cow pages in a shared writeable mapping.
> - * Just mark the pages writable and/or call ops->pfn_mkwrite.
> + * Just mark the pages writable as we can't do any dirty
> + * accounting on raw pfn maps.
> */
> if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> (VM_WRITE|VM_SHARED))
> - return wp_pfn_shared(mm, vma, address, page_table, ptl,
> - orig_pte, pmd);
> + return wp_page_reuse(mm, vma, address, page_table, ptl,
> + orig_pte, old_page, 0, 0);
>
> pte_unmap_unlock(page_table, ptl);
> return wp_page_copy(mm, vma, address, page_table, pmd,
> @@ -3282,8 +3247,8 @@ static int create_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
>
> if (vma_is_anonymous(vma))
> return do_huge_pmd_anonymous_page(mm, vma, address, pmd, flags);
> - if (vma->vm_ops->huge_fault)
> - return vma->vm_ops->huge_fault(vma, &vmf);
> + if (vma->vm_ops->fault)
> + return vma->vm_ops->fault(vma, &vmf);
> return VM_FAULT_FALLBACK;
> }
>
> @@ -3299,8 +3264,8 @@ static int wp_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
>
> if (vma_is_anonymous(vma))
> return do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd);
> - if (vma->vm_ops->huge_fault)
> - return vma->vm_ops->huge_fault(vma, &vmf);
> + if (vma->vm_ops->page_mkwrite)
> + return vma->vm_ops->page_mkwrite(vma, &vmf);
> return VM_FAULT_FALLBACK;
> }
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 407ab43..0d851cb 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1490,7 +1490,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma)
> return 0;
>
> /* The backer wishes to know when pages are first written to? */
> - if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite))
> + if (vm_ops && vm_ops->page_mkwrite)
> return 1;
>
> /* The open routine did something to the protections that pgprot_modify
> --
> 2.7.0.rc3
>
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
next prev parent reply other threads:[~2016-01-28 13:10 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-01-27 4:08 [PATCH 0/2] Fix dax races between page faults RFC only Matthew Wilcox
2016-01-27 4:08 ` [PATCH 1/2] mm,fs,dax: Change ->pmd_fault to ->huge_fault Matthew Wilcox
2016-01-27 5:48 ` kbuild test robot
2016-01-27 17:47 ` Ross Zwisler
2016-01-28 12:17 ` Jan Kara
2016-01-29 14:31 ` Matthew Wilcox
2016-01-27 4:08 ` [PATCH 2/2] dax: Giant hack Matthew Wilcox
2016-01-28 13:10 ` Jan Kara [this message]
2016-01-28 21:23 ` Dave Chinner
2016-01-29 22:29 ` Jared Hulbert
[not found] ` <CAOxpaSU_JgkeS=u61zxWTdP5hXymBkUsvkjkwNzm6XVig9y8RQ@mail.gmail.com>
2016-01-27 6:18 ` [PATCH 0/2] Fix dax races between page faults RFC only Matthew Wilcox
2016-02-01 21:25 ` Ross Zwisler
2016-01-28 12:48 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160128131059.GH7726@quack.suse.cz \
--to=jack@suse.cz \
--cc=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=matthew.r.wilcox@intel.com \
--cc=ross.zwisler@linux.intel.com \
--cc=willy@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).