From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: ross.zwisler@linux.intel.com, jack@suse.cz, xfs@oss.sgi.com
Subject: Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
Date: Mon, 2 Nov 2015 09:15:10 -0500 [thread overview]
Message-ID: <20151102141509.GA29346@bfoster.bfoster> (raw)
In-Reply-To: <20151102011433.GW19199@dastard>
On Mon, Nov 02, 2015 at 12:14:33PM +1100, Dave Chinner wrote:
> On Fri, Oct 30, 2015 at 08:36:57AM -0400, Brian Foster wrote:
> > On Fri, Oct 30, 2015 at 10:37:56AM +1100, Dave Chinner wrote:
> > > On Thu, Oct 29, 2015 at 10:29:50AM -0400, Brian Foster wrote:
> > > > On Mon, Oct 19, 2015 at 02:27:15PM +1100, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > >
> > ...
> > > > > + /*
> > > > > + * For DAX, we do not allocate unwritten extents, but instead we zero
> > > > > + * the block before we commit the transaction. Ideally we'd like to do
> > > > > + * this outside the transaction context, but if we commit and then crash
> > > > > + * we may not have zeroed the blocks and this will be exposed on
> > > > > + * recovery of the allocation. Hence we must zero before commit.
> > > > > + * Further, if we are mapping unwritten extents here, we need to zero
> > > > > + * and convert them to written so that we don't need an unwritten extent
> > > > > + * callback for DAX. This also means that we need to be able to dip into
> > > > > + * the reserve block pool if there is no space left but we need to do
> > > > > + * unwritten extent conversion.
> > > > > + */
> > > > > + if (IS_DAX(VFS_I(ip))) {
> > > > > + bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
> > > > > + tp->t_flags |= XFS_TRANS_RESERVE;
> > > > > + }
> > > >
> > > > Am I following the commit log description correctly in that block
> > > > zeroing is only required for DAX faults? Do we zero blocks for DAX DIO
> > > > as well to be consistent, or is that also required (because it looks
> > > > like we still have end_io completion for dio writes anyways)?
> > >
> > > DAX DIO will do the zeroing rather than using unwritten extents,
> > > too. But we still have DIO IO completion as that needs to do file
> > > size updates.
> > >
> >
> > Right, my question is: is the DAX DIO zeroing required to avoid the
> > races described as the purpose for this patch, or is this just here as a
> > simplification? In other words, why not do block zeroing only for DAX
> > faults and not DAX/DIO?
>
> Because the only reason the DIO code does 'allocate unwritten;
> convert unwritten on IO completion' is so that if we have:
>
> allocate
> trans_commit
> .... log force
> journal IO submit
> .... journal IO completion
> submit data io
> crash
>
> We don't expose allocated blocks containing stale data to userspace
> via recovery. The allcoation uses unwritten extents to ensure that
> if the allocation is recovered without the correspending completion,
> it reads as zeros rather whatever was previously on disk in taht
> location.
>
> For DAX, we can zero the blocks inside the allocation transaction
> for direct IO, and hence even if we have the above happen, we'll
> only ever expose zeros. Hence we don't need unwritten extents in the
> DIO path to avoid stale data exposure, and so we can simply avoid
> all that extra overhead of unwritten extent conversion on
> completion...
>
Yeah, I get that bit. In fact, I was hoping to get to doing something
similar for delalloc extents to deal with the similar issue there if the
extent conversion makes it to the log and we crash before I/O (as I
believe we've discussed in the past). That is for something unrelated,
however...
> > I ask because my understanding is the purpose of this patch is a special
> > atomic zeroed allocation requirement just for mmap.
>
> The requirement is set by DAX+mmap; the implementation is a generic
> "allocate zeroed blocks" mechanism that can be applied to any
> allocation that uses unwritten extents to allocate zeroed blocks if
> zeroing is more efficient than using unwritten extents....
>
Ok, I take that to mean that there is no race or corruption vector so
long as DAX/mmap uses the atomic zero allocation. The implementation
mostly looks pretty good to me, but I suspect I'm not being clear with
my question, which is...
> > Unless there is some
> > special mixed dio/mmap case I'm missing, doing so for DAX/DIO basically
> > causes a clear_pmem() over every page sized chunk of the target I/O
> > range for which we already have the data.
>
> I don't follow - this only zeros blocks when we do allocation of new
> blocks or overwrite unwritten extents, not on blocks which we
> already have written data extents allocated for...
>
Why are we assuming that block zeroing is more efficient than unwritten
extents for DAX/dio? I haven't played with pmem enough to know for sure
one way or another (or if hw support is imminent), but I'd expect the
latter to be more efficient in general without any kind of hardware
support.
Just as an example, here's an 8GB pwrite test, large buffer size, to XFS
on a ramdisk mounted with '-o dax:'
- Before this series:
# xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
wrote 8589934592/8589934592 bytes at offset 0
8.000 GiB, 820 ops; 0:00:04.00 (1.909 GiB/sec and 195.6591 ops/sec)
- After this series:
# xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
wrote 8589934592/8589934592 bytes at offset 0
8.000 GiB, 820 ops; 0:00:12.00 (659.790 MiB/sec and 66.0435 ops/sec)
The impact is less with a smaller buffer size so the above is just meant
to illustrate the point. FWIW, I'm also fine with getting this in as a
matter of "correctness before performance" since this stuff is clearly
still under development, but as far as I can see so far we should
probably ultimately prefer unwritten extents for DAX/DIO (or at least
plan to run some similar tests on real pmem hw). Thoughts?
Brian
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2015-11-02 14:15 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-10-19 3:27 [PATCH 0/6 V2] xfs: upfront block zeroing for DAX Dave Chinner
2015-10-19 3:27 ` [PATCH 1/6] xfs: fix inode size update overflow in xfs_map_direct() Dave Chinner
2015-10-29 14:27 ` Brian Foster
2015-10-19 3:27 ` [PATCH 2/6] xfs: introduce BMAPI_ZERO for allocating zeroed extents Dave Chinner
2015-10-29 14:27 ` Brian Foster
2015-10-29 23:35 ` Dave Chinner
2015-10-30 12:36 ` Brian Foster
2015-11-02 1:21 ` Dave Chinner
2015-10-19 3:27 ` [PATCH 3/6] xfs: Don't use unwritten extents for DAX Dave Chinner
2015-10-29 14:29 ` Brian Foster
2015-10-29 23:37 ` Dave Chinner
2015-10-30 12:36 ` Brian Foster
2015-11-02 1:14 ` Dave Chinner
2015-11-02 14:15 ` Brian Foster [this message]
2015-11-02 21:44 ` Dave Chinner
2015-11-03 3:53 ` Dan Williams
2015-11-03 5:04 ` Dave Chinner
2015-11-04 0:50 ` Ross Zwisler
2015-11-04 1:02 ` Dan Williams
2015-11-04 4:46 ` Ross Zwisler
2015-11-04 9:06 ` Jan Kara
2015-11-04 15:35 ` Ross Zwisler
2015-11-04 17:21 ` Jan Kara
2015-11-03 9:16 ` Jan Kara
2015-10-19 3:27 ` [PATCH 4/6] xfs: DAX does not use IO completion callbacks Dave Chinner
2015-10-29 14:29 ` Brian Foster
2015-10-29 23:39 ` Dave Chinner
2015-10-30 12:37 ` Brian Foster
2015-10-19 3:27 ` [PATCH 5/6] xfs: add ->pfn_mkwrite support for DAX Dave Chinner
2015-10-29 14:30 ` Brian Foster
2015-10-19 3:27 ` [PATCH 6/6] xfs: xfs_filemap_pmd_fault treats read faults as write faults Dave Chinner
2015-10-29 14:30 ` Brian Foster
2015-11-05 23:48 ` [PATCH 0/6 V2] xfs: upfront block zeroing for DAX Ross Zwisler
2015-11-06 22:32 ` Dave Chinner
2015-11-06 18:12 ` Boylston, Brian
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20151102141509.GA29346@bfoster.bfoster \
--to=bfoster@redhat.com \
--cc=david@fromorbit.com \
--cc=jack@suse.cz \
--cc=ross.zwisler@linux.intel.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox