From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: xfs@oss.sgi.com
Subject: Re: [PATCH 3/7] xfs: Don't use unwritten extents for DAX
Date: Mon, 2 Nov 2015 09:15:30 -0500 [thread overview]
Message-ID: <20151102141530.GC29346@bfoster.bfoster> (raw)
In-Reply-To: <1446435735-1526-4-git-send-email-david@fromorbit.com>
On Mon, Nov 02, 2015 at 02:42:11PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> DAX has a page fault serialisation problem with block allocation.
> Because it allows concurrent page faults and does not have a page
> lock to serialise faults to the same page, it can get two concurrent
> faults to the page that race.
>
> When two read faults race, this isn't a huge problem as the data
> underlying the page is not changing and so "detect and drop" works
> just fine. The issues are to do with write faults.
>
> When two write faults occur, we serialise block allocation in
> get_blocks() so only one faul will allocate the extent. It will,
> however, be marked as an unwritten extent, and that is where the
> problem lies - the DAX fault code cannot differentiate between a
> block that was just allocated and a block that was preallocated and
> needs zeroing. The result is that both write faults end up zeroing
> the block and attempting to convert it back to written.
>
> The problem is that the first fault can zero and convert before the
> second fault starts zeroing, resulting in the zeroing for the second
> fault overwriting the data that the first fault wrote with zeros.
> The second fault then attempts to convert the unwritten extent,
> which is then a no-op because it's already written. Data loss occurs
> as a result of this race.
>
> Because there is no sane locking construct in the page fault code
> that we can use for serialisation across the page faults, we need to
> ensure block allocation and zeroing occurs atomically in the
> filesystem. This means we can still take concurrent page faults and
> the only time they will serialise is in the filesystem
> mapping/allocation callback. The page fault code will always see
> written, initialised extents, so we will be able to remove the
> unwritten extent handling from the DAX code when all filesystems are
> converted.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
I'm still not convinced that block zeroing is the right thing to do for
DAX/dio (re: the discussion on the previous version), but the code looks
fine to me and we should be able to easily optimize this in a separate
patch if warranted:
Reviewed-by: Brian Foster <bfoster@redhat.com>
> fs/dax.c | 5 +++++
> fs/xfs/xfs_aops.c | 13 +++++++++----
> fs/xfs/xfs_iomap.c | 21 ++++++++++++++++++++-
> 3 files changed, 34 insertions(+), 5 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index a86d3cc..131fd35a 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -29,6 +29,11 @@
> #include <linux/uio.h>
> #include <linux/vmstat.h>
>
> +/*
> + * dax_clear_blocks() is called from within transaction context from XFS,
> + * and hence this means the stack from this point must follow GFP_NOFS
> + * semantics for all operations.
> + */
> int dax_clear_blocks(struct inode *inode, sector_t block, long size)
> {
> struct block_device *bdev = inode->i_sb->s_bdev;
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 366e41eb..7b4f849 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1293,15 +1293,12 @@ xfs_map_direct(
>
> trace_xfs_gbmap_direct(XFS_I(inode), offset, size, type, imap);
>
> - /* XXX: preparation for removing unwritten extents in DAX */
> -#if 0
> if (dax_fault) {
> ASSERT(type == XFS_IO_OVERWRITE);
> trace_xfs_gbmap_direct_none(XFS_I(inode), offset, size, type,
> imap);
> return;
> }
> -#endif
>
> if (bh_result->b_private) {
> ioend = bh_result->b_private;
> @@ -1429,10 +1426,12 @@ __xfs_get_blocks(
> if (error)
> goto out_unlock;
>
> + /* for DAX, we convert unwritten extents directly */
> if (create &&
> (!nimaps ||
> (imap.br_startblock == HOLESTARTBLOCK ||
> - imap.br_startblock == DELAYSTARTBLOCK))) {
> + imap.br_startblock == DELAYSTARTBLOCK) ||
> + (IS_DAX(inode) && ISUNWRITTEN(&imap)))) {
> if (direct || xfs_get_extsz_hint(ip)) {
> /*
> * xfs_iomap_write_direct() expects the shared lock. It
> @@ -1477,6 +1476,12 @@ __xfs_get_blocks(
> goto out_unlock;
> }
>
> + if (IS_DAX(inode) && create) {
> + ASSERT(!ISUNWRITTEN(&imap));
> + /* zeroing is not needed at a higher layer */
> + new = 0;
> + }
> +
> /* trim mapping down to size requested */
> if (direct || size > (1 << inode->i_blkbits))
> xfs_map_trim_size(inode, iblock, bh_result,
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index c3cb5a5..f4f5b43 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -132,6 +132,7 @@ xfs_iomap_write_direct(
> int committed;
> int error;
> int lockmode;
> + int bmapi_flags = XFS_BMAPI_PREALLOC;
>
> rt = XFS_IS_REALTIME_INODE(ip);
> extsz = xfs_get_extsz_hint(ip);
> @@ -195,6 +196,23 @@ xfs_iomap_write_direct(
> * Allocate and setup the transaction
> */
> tp = xfs_trans_alloc(mp, XFS_TRANS_DIOSTRAT);
> +
> + /*
> + * For DAX, we do not allocate unwritten extents, but instead we zero
> + * the block before we commit the transaction. Ideally we'd like to do
> + * this outside the transaction context, but if we commit and then crash
> + * we may not have zeroed the blocks and this will be exposed on
> + * recovery of the allocation. Hence we must zero before commit.
> + * Further, if we are mapping unwritten extents here, we need to zero
> + * and convert them to written so that we don't need an unwritten extent
> + * callback for DAX. This also means that we need to be able to dip into
> + * the reserve block pool if there is no space left but we need to do
> + * unwritten extent conversion.
> + */
> + if (IS_DAX(VFS_I(ip))) {
> + bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
> + tp->t_flags |= XFS_TRANS_RESERVE;
> + }
> error = xfs_trans_reserve(tp, &M_RES(mp)->tr_write,
> resblks, resrtextents);
> /*
> @@ -221,7 +239,7 @@ xfs_iomap_write_direct(
> xfs_bmap_init(&free_list, &firstfsb);
> nimaps = 1;
> error = xfs_bmapi_write(tp, ip, offset_fsb, count_fsb,
> - XFS_BMAPI_PREALLOC, &firstfsb, resblks, imap,
> + bmapi_flags, &firstfsb, resblks, imap,
> &nimaps, &free_list);
> if (error)
> goto out_bmap_cancel;
> @@ -232,6 +250,7 @@ xfs_iomap_write_direct(
> error = xfs_bmap_finish(&tp, &free_list, &committed);
> if (error)
> goto out_bmap_cancel;
> +
> error = xfs_trans_commit(tp);
> if (error)
> goto out_unlock;
> --
> 2.5.0
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2015-11-02 14:15 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-11-02 3:42 [PATCH 0/7] xfs: patches remaining for 4.4 merge window Dave Chinner
2015-11-02 3:42 ` [PATCH 1/7] xfs: fix inode size update overflow in xfs_map_direct() Dave Chinner
2015-11-02 3:42 ` [PATCH 2/7] xfs: introduce BMAPI_ZERO for allocating zeroed extents Dave Chinner
2015-11-02 14:15 ` Brian Foster
2015-11-02 3:42 ` [PATCH 3/7] xfs: Don't use unwritten extents for DAX Dave Chinner
2015-11-02 14:15 ` Brian Foster [this message]
2015-11-02 3:42 ` [PATCH 4/7] xfs: DAX does not use IO completion callbacks Dave Chinner
2015-11-02 3:42 ` [PATCH 5/7] xfs: add ->pfn_mkwrite support for DAX Dave Chinner
2015-11-02 3:42 ` [PATCH 6/7] xfs: xfs_filemap_pmd_fault treats read faults as write faults Dave Chinner
2015-11-02 3:42 ` [PATCH 7/7] xfs: optimise away log forces on timestamp updates for fdatasync Dave Chinner
2015-11-02 14:15 ` Brian Foster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20151102141530.GC29346@bfoster.bfoster \
--to=bfoster@redhat.com \
--cc=david@fromorbit.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox