public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: ross.zwisler@linux.intel.com, jack@suse.cz, xfs@oss.sgi.com
Subject: Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
Date: Thu, 29 Oct 2015 10:29:50 -0400	[thread overview]
Message-ID: <20151029142950.GE11663@bfoster.bfoster> (raw)
In-Reply-To: <1445225238-30413-4-git-send-email-david@fromorbit.com>

On Mon, Oct 19, 2015 at 02:27:15PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> DAX has a page fault serialisation problem with block allocation.
> Because it allows concurrent page faults and does not have a page
> lock to serialise faults to the same page, it can get two concurrent
> faults to the page that race.
> 
> When two read faults race, this isn't a huge problem as the data
> underlying the page is not changing and so "detect and drop" works
> just fine. The issues are to do with write faults.
> 
> When two write faults occur, we serialise block allocation in
> get_blocks() so only one faul will allocate the extent. It will,
> however, be marked as an unwritten extent, and that is where the
> problem lies - the DAX fault code cannot differentiate between a
> block that was just allocated and a block that was preallocated and
> needs zeroing. The result is that both write faults end up zeroing
> the block and attempting to convert it back to written.
> 
> The problem is that the first fault can zero and convert before the
> second fault starts zeroing, resulting in the zeroing for the second
> fault overwriting the data that the first fault wrote with zeros.
> The second fault then attempts to convert the unwritten extent,
> which is then a no-op because it's already written. Data loss occurs
> as a result of this race.
> 
> Because there is no sane locking construct in the page fault code
> that we can use for serialisation across the page faults, we need to
> ensure block allocation and zeroing occurs atomically in the
> filesystem. This means we can still take concurrent page faults and
> the only time they will serialise is in the filesystem
> mapping/allocation callback. The page fault code will always see
> written, initialised extents, so we will be able to remove the
> unwritten extent handling from the DAX code when all filesystems are
> converted.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/dax.c           |  5 +++++
>  fs/xfs/xfs_aops.c  | 13 +++++++++----
>  fs/xfs/xfs_iomap.c | 21 ++++++++++++++++++++-
>  3 files changed, 34 insertions(+), 5 deletions(-)
> 
...
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index c3cb5a5..f4f5b43 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
...
>  	tp = xfs_trans_alloc(mp, XFS_TRANS_DIOSTRAT);
> +
> +	/*
> +	 * For DAX, we do not allocate unwritten extents, but instead we zero
> +	 * the block before we commit the transaction.  Ideally we'd like to do
> +	 * this outside the transaction context, but if we commit and then crash
> +	 * we may not have zeroed the blocks and this will be exposed on
> +	 * recovery of the allocation. Hence we must zero before commit.
> +	 * Further, if we are mapping unwritten extents here, we need to zero
> +	 * and convert them to written so that we don't need an unwritten extent
> +	 * callback for DAX. This also means that we need to be able to dip into
> +	 * the reserve block pool if there is no space left but we need to do
> +	 * unwritten extent conversion.
> +	 */
> +	if (IS_DAX(VFS_I(ip))) {
> +		bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
> +		tp->t_flags |= XFS_TRANS_RESERVE;
> +	}

Am I following the commit log description correctly in that block
zeroing is only required for DAX faults? Do we zero blocks for DAX DIO
as well to be consistent, or is that also required (because it looks
like we still have end_io completion for dio writes anyways)?

Brian

>  	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_write,
>  				  resblks, resrtextents);
>  	/*
> @@ -221,7 +239,7 @@ xfs_iomap_write_direct(
>  	xfs_bmap_init(&free_list, &firstfsb);
>  	nimaps = 1;
>  	error = xfs_bmapi_write(tp, ip, offset_fsb, count_fsb,
> -				XFS_BMAPI_PREALLOC, &firstfsb, resblks, imap,
> +				bmapi_flags, &firstfsb, resblks, imap,
>  				&nimaps, &free_list);
>  	if (error)
>  		goto out_bmap_cancel;
> @@ -232,6 +250,7 @@ xfs_iomap_write_direct(
>  	error = xfs_bmap_finish(&tp, &free_list, &committed);
>  	if (error)
>  		goto out_bmap_cancel;
> +
>  	error = xfs_trans_commit(tp);
>  	if (error)
>  		goto out_unlock;
> -- 
> 2.5.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2015-10-29 14:29 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-19  3:27 [PATCH 0/6 V2] xfs: upfront block zeroing for DAX Dave Chinner
2015-10-19  3:27 ` [PATCH 1/6] xfs: fix inode size update overflow in xfs_map_direct() Dave Chinner
2015-10-29 14:27   ` Brian Foster
2015-10-19  3:27 ` [PATCH 2/6] xfs: introduce BMAPI_ZERO for allocating zeroed extents Dave Chinner
2015-10-29 14:27   ` Brian Foster
2015-10-29 23:35     ` Dave Chinner
2015-10-30 12:36       ` Brian Foster
2015-11-02  1:21         ` Dave Chinner
2015-10-19  3:27 ` [PATCH 3/6] xfs: Don't use unwritten extents for DAX Dave Chinner
2015-10-29 14:29   ` Brian Foster [this message]
2015-10-29 23:37     ` Dave Chinner
2015-10-30 12:36       ` Brian Foster
2015-11-02  1:14         ` Dave Chinner
2015-11-02 14:15           ` Brian Foster
2015-11-02 21:44             ` Dave Chinner
2015-11-03  3:53               ` Dan Williams
2015-11-03  5:04                 ` Dave Chinner
2015-11-04  0:50                   ` Ross Zwisler
2015-11-04  1:02                     ` Dan Williams
2015-11-04  4:46                       ` Ross Zwisler
2015-11-04  9:06                         ` Jan Kara
2015-11-04 15:35                           ` Ross Zwisler
2015-11-04 17:21                             ` Jan Kara
2015-11-03  9:16               ` Jan Kara
2015-10-19  3:27 ` [PATCH 4/6] xfs: DAX does not use IO completion callbacks Dave Chinner
2015-10-29 14:29   ` Brian Foster
2015-10-29 23:39     ` Dave Chinner
2015-10-30 12:37       ` Brian Foster
2015-10-19  3:27 ` [PATCH 5/6] xfs: add ->pfn_mkwrite support for DAX Dave Chinner
2015-10-29 14:30   ` Brian Foster
2015-10-19  3:27 ` [PATCH 6/6] xfs: xfs_filemap_pmd_fault treats read faults as write faults Dave Chinner
2015-10-29 14:30   ` Brian Foster
2015-11-05 23:48 ` [PATCH 0/6 V2] xfs: upfront block zeroing for DAX Ross Zwisler
2015-11-06 22:32   ` Dave Chinner
2015-11-06 18:12 ` Boylston, Brian

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151029142950.GE11663@bfoster.bfoster \
    --to=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=jack@suse.cz \
    --cc=ross.zwisler@linux.intel.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox