From: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
To: Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: Theodore Ts'o <tytso-3s7WtUTddSA@public.gmane.org>,
Matthew Wilcox <mawilcox-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>,
linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org,
Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
Andreas Dilger
<adilger.kernel-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org>,
Alexander Viro
<viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>,
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Andrew Morton
<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Subject: Re: [PATCH v2 2/9] ext2: tell DAX the size of allocation holes
Date: Fri, 26 Aug 2016 15:29:34 -0600 [thread overview]
Message-ID: <20160826212934.GA11265@linux.intel.com> (raw)
In-Reply-To: <20160825075728.GA11235-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
On Thu, Aug 25, 2016 at 12:57:28AM -0700, Christoph Hellwig wrote:
> Hi Ross,
>
> can you take at my (fully working, but not fully cleaned up) version
> of the iomap based DAX code here:
>
> http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/iomap-dax
>
> By using iomap we don't even have the size hole problem and totally
> get out of the reverse-engineer what buffer_heads are trying to tell
> us business. It also gets rid of the other warts of the DAX path
> due to pretending to be like direct I/O, so this might be a better
> way forward also for ext2/4.
In general I agree that the usage of struct iomap seems more straightforward
than the old way of using struct buffer_head + get_block_t. I really don't
think we want to have two competing DAX I/O and fault paths, though, which I
assume everyone else agrees with as well.
These changes don't remove the things in XFS needed by the old I/O and fault
paths (e.g. xfs_get_blocks_direct() is still there an unchanged). Is the
correct way forward to get buy-in from ext2/ext4 so that they also move to
supporting an iomap based I/O path (xfs_file_iomap_begin(),
xfs_iomap_write_direct(), etc?). That would allow us to have parallel I/O and
fault paths for a while, then remove the old buffer_head based versions when
the three supported filesystems have moved to iomap.
If ext2 and ext4 don't choose to move to iomap, though, I don't think we want
to have a separate I/O & fault path for iomap/XFS. That seems too painful,
and the old buffer_head version should continue to work, ugly as it may be.
Assuming we can get buy-in from ext4/ext2, I can work on a PMD version of the
iomap based fault path that is equivalent to the buffer_head based one I sent
out in my series, and we can all eventually move to that.
A few comments/questions on the implementation:
1) In your mail above you say "It also gets rid of the other warts of the DAX
path due to pretending to be like direct I/O". I assume by this you mean
the code in dax_do_io() around DIO_LOCKING, inode_dio_begin(), etc?
Perhaps there are other things as well in XFS, but this is what I see in
the DAX code. If so, yep, this seems like a win. I don't understand how
DIO_LOCKING is relevant to the DAX I/O path, as we never mix buffered and
direct access.
The comment in dax_do_io() for the inode_dio_begin() call says that it
prevents the I/O from races with truncate. Am I correct that we now get
this protection via the xfs_rw_ilock()/xfs_rw_iunlock() calls in
xfs_file_dax_write()?
2) Just a nit, I noticed that you used "~(PAGE_SIZE - 1)" in several places in
iomap_dax_actor() and iomap_dax_fault() instead of PAGE_MASK. Was this
intentional?
3) It's kind of weird having iomap_dax_fault() in fs/dax.c but having
iomap_dax_actor() and iomap_dax_rw() in fs/iomap.c? I'm guessing the
latter is placed where it is because it uses iomap_apply(), which is local
to fs/iomap.c? Anyway, it would be nice if we could keep them together, if
possible.
4) In iomap_dax_actor() you do this check:
WARN_ON_ONCE(iomap->type != IOMAP_MAPPED);
If we hit this we should bail with -EIO, yea? Otherwise we could write to
unmapped space or something horrible.
5) In iomap_dax_fault, I think the "I/O beyond the end of the file" check
might have been broken. Take for example an I/O to the second page of a
file, where the file has size one page. So:
vmf->pgoff = 1
i_size_read(inode) = 4096
Here's the old code in dax_fault():
size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
if (vmf->pgoff >= size)
return VM_FAULT_SIGBUS;
size = (4096 + 4096 - 1) >> PAGE_SHIFT = 1
vmf->pgoff is 1 and size is 1, so we return SIGBUS
Here's the new code:
if (pos >= i_size_read(inode) + PAGE_SIZE - 1)
return VM_FAULT_SIGBUS;
pos = vmf->pgoff << PAGE_SHIFT = 4096
i_size_read(inode) + PAGE_SIZE - 1 = 8193
so, 'pos' isn't >= where we calculate the end of the file to be, so we do I/O
Basically the old check did the "+ PAGE_SIZE - 1" so that the >> PAGE_SHIFT
was sure to round up to the next full page. You don't need this with your
current logic, so I think the test should just be:
if (pos >= i_size_read(inode))
return VM_FAULT_SIGBUS;
Right?
6) Regarding the "we don't even have the size hole problem" comment in your
mail, the current PMD logic requires us to know the size of the hole. This
is important so that we can fault in a huge zero page if we have a 2 MiB
hole. It's fine if that 2 MiB page then gets fragmented into 4k DAX
allocations when we start to do writes, but the path the other way doesn't
work. If we don't know the size of holes then we can't fault in a 2 MiB
zero page, so we'll use 4k zero pages to satisfy reads. This means that if
later we want to fault in a 2MiB DAX allocation, we don't have a single
entry that we can use to lock the entire 2MiB range while we clean the
radix tree an unmap the range from all the user processes. With the
current PMD logic this will mean that if someone does a 4k read that faults
in a 4k zero page, we will only use 4k faults for that range and won't use
PMDs.
The current XFS code in the v4.8 tree tells me the size of the hole, and I
think we need to keep this functionality.
next prev parent reply other threads:[~2016-08-26 21:29 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-08-23 22:04 [PATCH v2 0/9] re-enable DAX PMD support Ross Zwisler
2016-08-23 22:04 ` [PATCH v2 2/9] ext2: tell DAX the size of allocation holes Ross Zwisler
[not found] ` <20160823220419.11717-3-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2016-08-25 7:57 ` Christoph Hellwig
[not found] ` <20160825075728.GA11235-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2016-08-25 19:25 ` Ross Zwisler
2016-08-26 21:29 ` Ross Zwisler [this message]
2016-08-29 0:42 ` Dave Chinner
[not found] ` <20160826212934.GA11265-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2016-08-29 7:41 ` Christoph Hellwig
[not found] ` <20160829074116.GA16491-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2016-08-29 12:57 ` Theodore Ts'o
[not found] ` <20160829125741.cdnbb2uaditcmnw2-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2016-08-30 7:21 ` Christoph Hellwig
2016-09-09 16:48 ` Ross Zwisler
2016-09-09 20:35 ` Matthew Wilcox
[not found] ` <DM2PR21MB0089BCA980B67D8C53B25A1BCBFA0-B2pw06WL+/BAVFCO9/lqPs1VXTxX1y3OvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2016-09-09 22:34 ` Dan Williams
[not found] ` <CAPcyv4hjna08+Yw23w_V2f-RbBE6ar220+YGCuBVA-TACKWNug-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-09-10 7:31 ` Christoph Hellwig
2016-09-10 7:50 ` Matthew Wilcox
2016-09-10 17:49 ` Theodore Ts'o
[not found] ` <20160910174910.yyirb7smiob7evt5-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2016-09-11 0:42 ` Matthew Wilcox
2016-09-10 8:15 ` Matthew Wilcox
2016-09-10 14:56 ` Dan Williams
2016-09-10 7:30 ` Christoph Hellwig
2016-09-10 7:33 ` Matthew Wilcox
2016-09-10 7:42 ` Christoph Hellwig
2016-09-10 7:52 ` Matthew Wilcox
[not found] ` <DM2PR21MB0089C20EF469AA91A916867CCBFD0-B2pw06WL+/BAVFCO9/lqPs1VXTxX1y3OvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2016-09-11 12:47 ` Christoph Hellwig
2016-09-11 22:57 ` Ross Zwisler
2016-09-10 15:55 ` Matthew Wilcox
2016-09-15 20:09 ` Ross Zwisler
[not found] ` <20160823220419.11717-1-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2016-08-23 22:04 ` [PATCH v2 1/9] ext4: allow DAX writeback for hole punch Ross Zwisler
2016-09-21 15:22 ` Ross Zwisler
2016-09-22 6:59 ` Jan Kara
2016-09-22 15:51 ` Theodore Ts'o
2016-08-23 22:04 ` [PATCH v2 3/9] ext4: tell DAX the size of allocation holes Ross Zwisler
2016-08-23 22:04 ` [PATCH v2 4/9] dax: remove buffer_size_valid() Ross Zwisler
2016-08-23 22:04 ` [PATCH v2 5/9] dax: make 'wait_table' global variable static Ross Zwisler
2016-08-23 22:04 ` [PATCH v2 6/9] dax: consistent variable naming for DAX entries Ross Zwisler
2016-08-23 22:04 ` [PATCH v2 7/9] dax: coordinate locking for offsets in PMD range Ross Zwisler
2016-08-23 22:04 ` [PATCH v2 8/9] dax: re-enable DAX PMD support Ross Zwisler
2016-08-23 22:04 ` [PATCH v2 9/9] dax: remove "depends on BROKEN" from FS_DAX_PMD Ross Zwisler
2016-08-30 23:01 ` [PATCH v2 0/9] re-enable DAX PMD support Ross Zwisler
2016-08-31 20:20 ` Kani, Toshimitsu
[not found] ` <1472674799.2092.19.camel-ZPxbGqLxI0U@public.gmane.org>
2016-08-31 21:36 ` Ross Zwisler
[not found] ` <20160831213607.GA6921-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2016-08-31 22:08 ` Kani, Toshimitsu
2016-09-01 16:21 ` Ross Zwisler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160826212934.GA11265@linux.intel.com \
--to=ross.zwisler-vuqaysv1563yd54fqh9/ca@public.gmane.org \
--cc=adilger.kernel-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org \
--cc=akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org \
--cc=david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org \
--cc=hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org \
--cc=jack-IBi9RG/b67k@public.gmane.org \
--cc=linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org \
--cc=linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org \
--cc=mawilcox-0li6OtcxBFHby3iVrkZq2A@public.gmane.org \
--cc=tytso-3s7WtUTddSA@public.gmane.org \
--cc=viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).