Re: [PATCH v12 08/20] dax,ext2: Replace the XIP page fault handler with the DAX page fault handler

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Matthew Wilcox <willy@linux.intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, willy@linux.intel.com
Subject: Re: [PATCH v12 08/20] dax,ext2: Replace the XIP page fault handler with the DAX page fault handler
Date: Tue, 13 Jan 2015 16:53:34 -0500	[thread overview]
Message-ID: <20150113215334.GK5661@wil.cx> (raw)
In-Reply-To: <20150112150952.b44ee750a6292284e7a909ff@linux-foundation.org>

On Mon, Jan 12, 2015 at 03:09:52PM -0800, Andrew Morton wrote:
> On Fri, 24 Oct 2014 17:20:40 -0400 Matthew Wilcox <matthew.r.wilcox@intel.com> wrote:
> 
> > Instead of calling aops->get_xip_mem from the fault handler, the
> > filesystem passes a get_block_t that is used to find the appropriate
> > blocks.
> > 
> > ...
> >
> > +static int copy_user_bh(struct page *to, struct buffer_head *bh,
> > +			unsigned blkbits, unsigned long vaddr)
> > +{
> > +	void *vfrom, *vto;
> > +	if (dax_get_addr(bh, &vfrom, blkbits) < 0)
> > +		return -EIO;
> > +	vto = kmap_atomic(to);
> > +	copy_user_page(vto, vfrom, vaddr, to);
> > +	kunmap_atomic(vto);
> 
> Again, please check the cache-flush aspects.  copy_user_page() appears
> to be reponsible for handling coherency issues on the destination
> vaddr, but what about *vto?

vto is a new kernel address ... if there's any dirty data for that
address, it should have been flushed by the prior kunmap_atomic(), right?

> > +	mutex_lock(&mapping->i_mmap_mutex);
> > +
> > +	/*
> > +	 * Check truncate didn't happen while we were allocating a block.
> > +	 * If it did, this block may or may not be still allocated to the
> > +	 * file.  We can't tell the filesystem to free it because we can't
> > +	 * take i_mutex here.
> 
> (what's preventing us from taking i_mutex?)

We're in a page fault handler, and we may already be holding i_mutex.
We're definitely holding mmap_sem, and to quote from mm/rmap.c:

/*
 * Lock ordering in mm:
 *
 * inode->i_mutex       (while writing or truncating, not reading or faulting)
 *   mm->mmap_sem

> >  	   In the worst case, the file still has blocks
> > +	 * allocated past the end of the file.
> > +	 */
> > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > +	if (unlikely(vmf->pgoff >= size)) {
> > +		error = -EIO;
> > +		goto out;
> > +	}
> 
> How does this play with holepunching?  Checking i_size won't work there?

It doesn't.  But the same problem exists with non-DAX files too, and
when I pointed it out, it was met with a shrug from the crowd.  I saw a
patch series just recently that fixes it for XFS, but as far as I know,
btrfs and ext4 still don't play well with pagefault vs hole-punch races.

> > +	memset(&bh, 0, sizeof(bh));
> > +	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits);
> > +	bh.b_size = PAGE_SIZE;
> 
> ah, there.
> 
> PAGE_SIZE varies a lot between architectures.  What are the
> implications of this>?

At the moment, you can only do DAX for blocksizes that are equal to
PAGE_SIZE.  That's a restriction that existed for the previous XIP code,
and I haven't fixed it all for DAX yet.  I'd like to, but it's not high on
my list of things to fix.  Since these are in-mmeory filesystems, there's
not likely to be high demand to move the filesystem between machines.

> > + repeat:
> > +	page = find_get_page(mapping, vmf->pgoff);
> > +	if (page) {
> > +		if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
> > +			page_cache_release(page);
> > +			return VM_FAULT_RETRY;
> > +		}
> > +		if (unlikely(page->mapping != mapping)) {
> > +			unlock_page(page);
> > +			page_cache_release(page);
> > +			goto repeat;
> > +		}
> > +		size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > +		if (unlikely(vmf->pgoff >= size)) {
> > +			error = -EIO;
> 
> What happened when this happens?

This case is where we have a struct page covering a hole in the file from
a read fault and we've raced with a truncate.  It's basically the same code
that's in filemap_fault().

> > +			goto unlock_page;
> > +		}
> > +	}
> > +
> > +	error = get_block(inode, block, &bh, 0);
> > +	if (!error && (bh.b_size < PAGE_SIZE))
> > +		error = -EIO;
> 
> How could this happen?

The only way I can think of is if the filesystem was corrupted.  But it's
worth programming defensively, no?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Matthew Wilcox <willy@linux.intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, willy@linux.intel.com
Subject: Re: [PATCH v12 08/20] dax,ext2: Replace the XIP page fault handler with the DAX page fault handler
Date: Tue, 13 Jan 2015 16:53:34 -0500	[thread overview]
Message-ID: <20150113215334.GK5661@wil.cx> (raw)
In-Reply-To: <20150112150952.b44ee750a6292284e7a909ff@linux-foundation.org>

On Mon, Jan 12, 2015 at 03:09:52PM -0800, Andrew Morton wrote:
> On Fri, 24 Oct 2014 17:20:40 -0400 Matthew Wilcox <matthew.r.wilcox@intel.com> wrote:
> 
> > Instead of calling aops->get_xip_mem from the fault handler, the
> > filesystem passes a get_block_t that is used to find the appropriate
> > blocks.
> > 
> > ...
> >
> > +static int copy_user_bh(struct page *to, struct buffer_head *bh,
> > +			unsigned blkbits, unsigned long vaddr)
> > +{
> > +	void *vfrom, *vto;
> > +	if (dax_get_addr(bh, &vfrom, blkbits) < 0)
> > +		return -EIO;
> > +	vto = kmap_atomic(to);
> > +	copy_user_page(vto, vfrom, vaddr, to);
> > +	kunmap_atomic(vto);
> 
> Again, please check the cache-flush aspects.  copy_user_page() appears
> to be reponsible for handling coherency issues on the destination
> vaddr, but what about *vto?

vto is a new kernel address ... if there's any dirty data for that
address, it should have been flushed by the prior kunmap_atomic(), right?

> > +	mutex_lock(&mapping->i_mmap_mutex);
> > +
> > +	/*
> > +	 * Check truncate didn't happen while we were allocating a block.
> > +	 * If it did, this block may or may not be still allocated to the
> > +	 * file.  We can't tell the filesystem to free it because we can't
> > +	 * take i_mutex here.
> 
> (what's preventing us from taking i_mutex?)

We're in a page fault handler, and we may already be holding i_mutex.
We're definitely holding mmap_sem, and to quote from mm/rmap.c:

/*
 * Lock ordering in mm:
 *
 * inode->i_mutex       (while writing or truncating, not reading or faulting)
 *   mm->mmap_sem

> >  	   In the worst case, the file still has blocks
> > +	 * allocated past the end of the file.
> > +	 */
> > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > +	if (unlikely(vmf->pgoff >= size)) {
> > +		error = -EIO;
> > +		goto out;
> > +	}
> 
> How does this play with holepunching?  Checking i_size won't work there?

It doesn't.  But the same problem exists with non-DAX files too, and
when I pointed it out, it was met with a shrug from the crowd.  I saw a
patch series just recently that fixes it for XFS, but as far as I know,
btrfs and ext4 still don't play well with pagefault vs hole-punch races.

> > +	memset(&bh, 0, sizeof(bh));
> > +	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits);
> > +	bh.b_size = PAGE_SIZE;
> 
> ah, there.
> 
> PAGE_SIZE varies a lot between architectures.  What are the
> implications of this>?

At the moment, you can only do DAX for blocksizes that are equal to
PAGE_SIZE.  That's a restriction that existed for the previous XIP code,
and I haven't fixed it all for DAX yet.  I'd like to, but it's not high on
my list of things to fix.  Since these are in-mmeory filesystems, there's
not likely to be high demand to move the filesystem between machines.

> > + repeat:
> > +	page = find_get_page(mapping, vmf->pgoff);
> > +	if (page) {
> > +		if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
> > +			page_cache_release(page);
> > +			return VM_FAULT_RETRY;
> > +		}
> > +		if (unlikely(page->mapping != mapping)) {
> > +			unlock_page(page);
> > +			page_cache_release(page);
> > +			goto repeat;
> > +		}
> > +		size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > +		if (unlikely(vmf->pgoff >= size)) {
> > +			error = -EIO;
> 
> What happened when this happens?

This case is where we have a struct page covering a hole in the file from
a read fault and we've raced with a truncate.  It's basically the same code
that's in filemap_fault().

> > +			goto unlock_page;
> > +		}
> > +	}
> > +
> > +	error = get_block(inode, block, &bh, 0);
> > +	if (!error && (bh.b_size < PAGE_SIZE))
> > +		error = -EIO;
> 
> How could this happen?

The only way I can think of is if the filesystem was corrupted.  But it's
worth programming defensively, no?

next prev parent reply	other threads:[~2015-01-13 21:53 UTC|newest]

Thread overview: 119+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-24 21:20 [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Matthew Wilcox
2014-10-24 21:20 ` Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 01/20] axonram: Fix bug in direct_access Matthew Wilcox
2014-10-24 21:20   ` Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 02/20] block: Change direct_access calling convention Matthew Wilcox
2014-10-24 21:20   ` Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 03/20] mm: Fix XIP fault vs truncate race Matthew Wilcox
2014-10-24 21:20   ` Matthew Wilcox
2015-01-12 23:09   ` Andrew Morton
2015-01-12 23:09     ` Andrew Morton
2015-01-13 18:50     ` Matthew Wilcox
2015-01-13 18:50       ` Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 04/20] mm: Allow page fault handlers to perform the COW Matthew Wilcox
2014-10-24 21:20   ` Matthew Wilcox
2015-01-12 23:09   ` Andrew Morton
2015-01-12 23:09     ` Andrew Morton
2015-01-13 18:58     ` Matthew Wilcox
2015-01-13 18:58       ` Matthew Wilcox
2015-02-05  9:16   ` Yigal Korman
2015-02-05  9:16     ` Yigal Korman
2015-02-05 21:39     ` Matthew Wilcox
2015-02-05 21:39       ` Matthew Wilcox
2015-02-08 11:48       ` Yigal Korman
2015-02-08 11:48         ` Yigal Korman
2014-10-24 21:20 ` [PATCH v12 05/20] vfs,ext2: Introduce IS_DAX(inode) Matthew Wilcox
2014-10-24 21:20   ` Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 06/20] dax,ext2: Replace XIP read and write with DAX I/O Matthew Wilcox
2014-10-24 21:20   ` Matthew Wilcox
2015-01-12 23:09   ` Andrew Morton
2015-01-12 23:09     ` Andrew Morton
2015-01-13 20:59     ` Matthew Wilcox
2015-01-13 20:59       ` Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 07/20] dax,ext2: Replace ext2_clear_xip_target with dax_clear_blocks Matthew Wilcox
2014-10-24 21:20   ` Matthew Wilcox
2015-01-12 23:09   ` Andrew Morton
2015-01-12 23:09     ` Andrew Morton
2015-01-13 21:39     ` Matthew Wilcox
2015-01-13 21:39       ` Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 08/20] dax,ext2: Replace the XIP page fault handler with the DAX page fault handler Matthew Wilcox
2014-10-24 21:20   ` Matthew Wilcox
2015-01-12 23:09   ` Andrew Morton
2015-01-12 23:09     ` Andrew Morton
2015-01-13 21:53     ` Matthew Wilcox [this message]
2015-01-13 21:53       ` Matthew Wilcox
2015-01-13 22:47       ` Andrew Morton
2015-01-13 22:47         ` Andrew Morton
2014-10-24 21:20 ` [PATCH v12 09/20] dax,ext2: Replace xip_truncate_page with dax_truncate_page Matthew Wilcox
2014-10-24 21:20   ` Matthew Wilcox
2015-01-12 23:09   ` Andrew Morton
2015-01-12 23:09     ` Andrew Morton
2015-01-13 21:55     ` Matthew Wilcox
2015-01-13 21:55       ` Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation Matthew Wilcox
2014-10-24 21:20   ` Matthew Wilcox
2015-01-12 23:10   ` Andrew Morton
2015-01-12 23:10     ` Andrew Morton
2016-01-21 18:38   ` Jared Hulbert
2016-01-21 18:38     ` Jared Hulbert
2016-01-22 13:07     ` Wilcox, Matthew R
2016-01-22 13:48       ` Chris Brandt
2016-01-22 14:39         ` Matthew Wilcox
2016-01-22 14:39           ` Matthew Wilcox
2016-01-24  9:03       ` Jared Hulbert
2016-01-24  9:03         ` Jared Hulbert
2016-01-25 16:52         ` Matthew Wilcox
2016-01-25 16:52           ` Matthew Wilcox
2016-01-25 21:18           ` Jared Hulbert
2016-01-25 21:18             ` Jared Hulbert
2016-01-27 19:51             ` Jared Hulbert
2016-01-27 19:51               ` Jared Hulbert
2014-10-24 21:20 ` [PATCH v12 11/20] vfs: Remove get_xip_mem Matthew Wilcox
2014-10-24 21:20   ` Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 12/20] ext2: Remove ext2_xip_verify_sb() Matthew Wilcox
2014-10-24 21:20   ` Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 13/20] ext2: Remove ext2_use_xip Matthew Wilcox
2014-10-24 21:20   ` Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 14/20] ext2: Remove xip.c and xip.h Matthew Wilcox
2014-10-24 21:20   ` Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 15/20] vfs,ext2: Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX Matthew Wilcox
2014-10-24 21:20   ` Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 16/20] ext2: Remove ext2_aops_xip Matthew Wilcox
2014-10-24 21:20   ` Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 17/20] ext2: Get rid of most mentions of XIP in ext2 Matthew Wilcox
2014-10-24 21:20   ` Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 18/20] dax: Add dax_zero_page_range Matthew Wilcox
2014-10-24 21:20   ` Matthew Wilcox
2015-01-12 23:10   ` Andrew Morton
2015-01-12 23:10     ` Andrew Morton
2015-01-12 23:20     ` Ross Zwisler
2015-01-12 23:20       ` Ross Zwisler
2014-10-24 21:20 ` [PATCH v12 19/20] ext4: Add DAX functionality Matthew Wilcox
2014-10-24 21:20   ` Matthew Wilcox
2014-10-24 21:20 ` [PATCH v12 20/20] brd: Rename XIP to DAX Matthew Wilcox
2014-10-24 21:20   ` Matthew Wilcox
2014-12-10 14:03 ` [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Christoph Hellwig
2014-12-10 14:03   ` Christoph Hellwig
2014-12-10 14:12   ` Matthew Wilcox
2014-12-10 14:12     ` Matthew Wilcox
2014-12-10 14:28     ` Jeff Moyer
2014-12-10 14:28       ` Jeff Moyer
2014-12-10 20:53     ` Dave Chinner
2014-12-10 20:53       ` Dave Chinner
2015-01-05 18:41     ` Christoph Hellwig
2015-01-05 18:41       ` Christoph Hellwig
2015-01-06  8:47       ` Andrew Morton
2015-01-06  8:47         ` Andrew Morton
2015-01-08 11:49         ` pread2/ pwrite2 Christoph Hellwig
2015-01-08 11:49           ` Christoph Hellwig
2015-01-09 19:30           ` Steve French
2015-01-09 19:30             ` Steve French
2015-01-08 16:27         ` [PATCH v12 00/20] DAX: Page cache bypass for filesystems on memory storage Milosz Tanski
2015-01-08 16:28         ` Milosz Tanski
2015-01-08 16:28           ` Milosz Tanski
2015-01-08 17:36           ` Jeremy Allison
2015-01-08 17:36             ` Jeremy Allison
2015-01-12 14:47         ` Matthew Wilcox
2015-01-12 14:47           ` Matthew Wilcox
2015-01-12 23:09 ` Andrew Morton
2015-01-12 23:09   ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150113215334.GK5661@wil.cx \
    --to=willy@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=matthew.r.wilcox@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.