linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
To: Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: Theodore Ts'o <tytso-3s7WtUTddSA@public.gmane.org>,
	Matthew Wilcox <mawilcox-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org,
	Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	Andreas Dilger
	<adilger.kernel-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org>,
	Alexander Viro
	<viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>,
	Jan Kara <jack-IBi9RG/b67k@public.gmane.org>,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Subject: Re: [PATCH v2 2/9] ext2: tell DAX the size of allocation holes
Date: Fri, 26 Aug 2016 15:29:34 -0600	[thread overview]
Message-ID: <20160826212934.GA11265@linux.intel.com> (raw)
In-Reply-To: <20160825075728.GA11235-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>

On Thu, Aug 25, 2016 at 12:57:28AM -0700, Christoph Hellwig wrote:
> Hi Ross,
> 
> can you take at my (fully working, but not fully cleaned up) version
> of the iomap based DAX code here:
> 
> http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/iomap-dax
> 
> By using iomap we don't even have the size hole problem and totally
> get out of the reverse-engineer what buffer_heads are trying to tell
> us business.  It also gets rid of the other warts of the DAX path
> due to pretending to be like direct I/O, so this might be a better
> way forward also for ext2/4.

In general I agree that the usage of struct iomap seems more straightforward
than the old way of using struct buffer_head + get_block_t.  I really don't
think we want to have two competing DAX I/O and fault paths, though, which I
assume everyone else agrees with as well.

These changes don't remove the things in XFS needed by the old I/O and fault
paths (e.g.  xfs_get_blocks_direct() is still there an unchanged).  Is the
correct way forward to get buy-in from ext2/ext4 so that they also move to
supporting an iomap based I/O path (xfs_file_iomap_begin(),
xfs_iomap_write_direct(), etc?).  That would allow us to have parallel I/O and
fault paths for a while, then remove the old buffer_head based versions when
the three supported filesystems have moved to iomap.

If ext2 and ext4 don't choose to move to iomap, though, I don't think we want
to have a separate I/O & fault path for iomap/XFS.  That seems too painful,
and the old buffer_head version should continue to work, ugly as it may be.

Assuming we can get buy-in from ext4/ext2, I can work on a PMD version of the
iomap based fault path that is equivalent to the buffer_head based one I sent
out in my series, and we can all eventually move to that.

A few comments/questions on the implementation:

1) In your mail above you say "It also gets rid of the other warts of the DAX
   path due to pretending to be like direct I/O".  I assume by this you mean
   the code in dax_do_io() around DIO_LOCKING, inode_dio_begin(), etc?
   Perhaps there are other things as well in XFS, but this is what I see in
   the DAX code.  If so, yep, this seems like a win.  I don't understand how
   DIO_LOCKING is relevant to the DAX I/O path, as we never mix buffered and
   direct access.

   The comment in dax_do_io() for the inode_dio_begin() call says that it
   prevents the I/O from races with truncate.  Am I correct that we now get
   this protection via the xfs_rw_ilock()/xfs_rw_iunlock() calls in
   xfs_file_dax_write()?

2) Just a nit, I noticed that you used "~(PAGE_SIZE - 1)" in several places in
   iomap_dax_actor() and iomap_dax_fault() instead of PAGE_MASK.  Was this
   intentional?

3) It's kind of weird having iomap_dax_fault() in fs/dax.c but having
   iomap_dax_actor() and iomap_dax_rw() in fs/iomap.c?  I'm guessing the
   latter is placed where it is because it uses iomap_apply(), which is local
   to fs/iomap.c?  Anyway, it would be nice if we could keep them together, if
   possible.

4) In iomap_dax_actor() you do this check:

	WARN_ON_ONCE(iomap->type != IOMAP_MAPPED);

   If we hit this we should bail with -EIO, yea?  Otherwise we could write to
   unmapped space or something horrible.

5) In iomap_dax_fault, I think the "I/O beyond the end of the file" check
   might have been broken.  Take for example an I/O to the second page of a
   file, where the file has size one page.  So:
   
   vmf->pgoff = 1
   i_size_read(inode) = 4096 
   
   Here's the old code in dax_fault():

	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
	if (vmf->pgoff >= size)
		return VM_FAULT_SIGBUS;

   size = (4096 + 4096 - 1) >> PAGE_SHIFT = 1
   vmf->pgoff is 1 and size is 1, so we return SIGBUS

   Here's the new code:

	   if (pos >= i_size_read(inode) + PAGE_SIZE - 1)
		return VM_FAULT_SIGBUS;

   pos = vmf->pgoff << PAGE_SHIFT = 4096
   i_size_read(inode) + PAGE_SIZE  - 1 = 8193
   so, 'pos' isn't >= where we calculate the end of the file to be, so we do I/O

   Basically the old check did the "+ PAGE_SIZE - 1" so that the >> PAGE_SHIFT
   was sure to round up to the next full page.  You don't need this with your
   current logic, so I think the test should just be:

	   if (pos >= i_size_read(inode))
		return VM_FAULT_SIGBUS;

   Right?

6) Regarding the "we don't even have the size hole problem" comment in your
   mail, the current PMD logic requires us to know the size of the hole.  This
   is important so that we can fault in a huge zero page if we have a 2 MiB
   hole.  It's fine if that 2 MiB page then gets fragmented into 4k DAX
   allocations when we start to do writes, but the path the other way doesn't
   work.  If we don't know the size of holes then we can't fault in a 2 MiB
   zero page, so we'll use 4k zero pages to satisfy reads.  This means that if
   later we want to fault in a 2MiB DAX allocation, we don't have a single
   entry that we can use to lock the entire 2MiB range while we clean the
   radix tree an unmap the range from all the user processes.  With the
   current PMD logic this will mean that if someone does a 4k read that faults
   in a 4k zero page, we will only use 4k faults for that range and won't use
   PMDs.

   The current XFS code in the v4.8 tree tells me the size of the hole, and I
   think we need to keep this functionality.

  parent reply	other threads:[~2016-08-26 21:29 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-23 22:04 [PATCH v2 0/9] re-enable DAX PMD support Ross Zwisler
2016-08-23 22:04 ` [PATCH v2 2/9] ext2: tell DAX the size of allocation holes Ross Zwisler
     [not found]   ` <20160823220419.11717-3-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2016-08-25  7:57     ` Christoph Hellwig
     [not found]       ` <20160825075728.GA11235-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2016-08-25 19:25         ` Ross Zwisler
2016-08-26 21:29         ` Ross Zwisler [this message]
2016-08-29  0:42           ` Dave Chinner
     [not found]           ` <20160826212934.GA11265-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2016-08-29  7:41             ` Christoph Hellwig
     [not found]               ` <20160829074116.GA16491-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2016-08-29 12:57                 ` Theodore Ts'o
     [not found]                   ` <20160829125741.cdnbb2uaditcmnw2-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2016-08-30  7:21                     ` Christoph Hellwig
2016-09-09 16:48                     ` Ross Zwisler
2016-09-09 20:35                       ` Matthew Wilcox
     [not found]                         ` <DM2PR21MB0089BCA980B67D8C53B25A1BCBFA0-B2pw06WL+/BAVFCO9/lqPs1VXTxX1y3OvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2016-09-09 22:34                           ` Dan Williams
     [not found]                             ` <CAPcyv4hjna08+Yw23w_V2f-RbBE6ar220+YGCuBVA-TACKWNug-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-09-10  7:31                               ` Christoph Hellwig
2016-09-10  7:50                                 ` Matthew Wilcox
2016-09-10 17:49                                 ` Theodore Ts'o
     [not found]                                   ` <20160910174910.yyirb7smiob7evt5-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2016-09-11  0:42                                     ` Matthew Wilcox
2016-09-10  8:15                               ` Matthew Wilcox
2016-09-10 14:56                                 ` Dan Williams
2016-09-10  7:30                         ` Christoph Hellwig
2016-09-10  7:33                           ` Matthew Wilcox
2016-09-10  7:42                             ` Christoph Hellwig
2016-09-10  7:52                               ` Matthew Wilcox
     [not found]                                 ` <DM2PR21MB0089C20EF469AA91A916867CCBFD0-B2pw06WL+/BAVFCO9/lqPs1VXTxX1y3OvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2016-09-11 12:47                                   ` Christoph Hellwig
2016-09-11 22:57                                     ` Ross Zwisler
2016-09-10 15:55                           ` Matthew Wilcox
2016-09-15 20:09   ` Ross Zwisler
     [not found] ` <20160823220419.11717-1-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2016-08-23 22:04   ` [PATCH v2 1/9] ext4: allow DAX writeback for hole punch Ross Zwisler
2016-09-21 15:22     ` Ross Zwisler
2016-09-22  6:59       ` Jan Kara
2016-09-22 15:51       ` Theodore Ts'o
2016-08-23 22:04   ` [PATCH v2 3/9] ext4: tell DAX the size of allocation holes Ross Zwisler
2016-08-23 22:04   ` [PATCH v2 4/9] dax: remove buffer_size_valid() Ross Zwisler
2016-08-23 22:04   ` [PATCH v2 5/9] dax: make 'wait_table' global variable static Ross Zwisler
2016-08-23 22:04   ` [PATCH v2 6/9] dax: consistent variable naming for DAX entries Ross Zwisler
2016-08-23 22:04   ` [PATCH v2 7/9] dax: coordinate locking for offsets in PMD range Ross Zwisler
2016-08-23 22:04   ` [PATCH v2 8/9] dax: re-enable DAX PMD support Ross Zwisler
2016-08-23 22:04   ` [PATCH v2 9/9] dax: remove "depends on BROKEN" from FS_DAX_PMD Ross Zwisler
2016-08-30 23:01   ` [PATCH v2 0/9] re-enable DAX PMD support Ross Zwisler
2016-08-31 20:20     ` Kani, Toshimitsu
     [not found]       ` <1472674799.2092.19.camel-ZPxbGqLxI0U@public.gmane.org>
2016-08-31 21:36         ` Ross Zwisler
     [not found]           ` <20160831213607.GA6921-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2016-08-31 22:08             ` Kani, Toshimitsu
2016-09-01 16:21               ` Ross Zwisler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160826212934.GA11265@linux.intel.com \
    --to=ross.zwisler-vuqaysv1563yd54fqh9/ca@public.gmane.org \
    --cc=adilger.kernel-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org \
    --cc=akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org \
    --cc=david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org \
    --cc=hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org \
    --cc=jack-IBi9RG/b67k@public.gmane.org \
    --cc=linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org \
    --cc=linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org \
    --cc=mawilcox-0li6OtcxBFHby3iVrkZq2A@public.gmane.org \
    --cc=tytso-3s7WtUTddSA@public.gmane.org \
    --cc=viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).