linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
To: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>,
	"linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org"
	<linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org>,
	XFS Developers <xfs-VZNHf3L845pBDgjK7y7TUQ@public.gmane.org>,
	linux-fsdevel
	<linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	linux-ext4 <linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: Re: Subtle races between DAX mmap fault and write path
Date: Sat, 30 Jul 2016 10:12:49 +1000	[thread overview]
Message-ID: <20160730001249.GE16044@dastard> (raw)
In-Reply-To: <CAPcyv4gOcDGzikJHYGxNXtYqQKkPUgkG+z4ASxogQUnp1zmD2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, Jul 29, 2016 at 07:44:25AM -0700, Dan Williams wrote:
> On Thu, Jul 28, 2016 at 7:21 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
> > On Thu, Jul 28, 2016 at 10:10:33AM +0200, Jan Kara wrote:
> >> On Thu 28-07-16 08:19:49, Dave Chinner wrote:
> [..]
> >> So DAX doesn't need flushing to maintain consistent view of the data but it
> >> does need flushing to make sure fsync(2) results in data written via mmap
> >> to reach persistent storage.
> >
> > I thought this all changed with the removal of the pcommit
> > instruction and wmb_pmem() going away.  Isn't it now a platform
> > requirement now that dirty cache lines over persistent memory ranges
> > are either guaranteed to be flushed to persistent storage on power
> > fail or when required by REQ_FLUSH?
> 
> No, nothing automates cache flushing.  The path of a write is:
> 
> cpu-cache -> cpu-write-buffer -> bus -> imc -> imc-write-buffer -> media
> 
> The ADR mechanism and the wpq-flush facility flush data thorough the
> imc (integrated memory controller) to media.  dax_do_io() gets writes
> to the imc, but we still need a posted-write-buffer flush mechanism to
> guarantee data makes it out to media.

So what you are saying is that on and ADR machine, we have these
domains w.r.t. power fail:

cpu-cache -> cpu-write-buffer -> bus -> imc -> imc-write-buffer -> media

|-------------volatile-------------------|-----persistent--------------|

because anything that gets to the IMC is guaranteed to be flushed to
stable media on power fail.

But on a posted-write-buffer system, we have this:

cpu-cache -> cpu-write-buffer -> bus -> imc -> imc-write-buffer -> media

|-------------volatile-------------------------------------------|--persistent--|

IOWs, only things already posted to the media via REQ_FLUSH are
considered stable on persistent media.  What happens in this case
when power fails during a media update? Incomplete writes?

> > Or have we somehow ended up with the fucked up situation where
> > dax_do_io() writes are (effectively) immediately persistent and
> > untracked by internal infrastructure, whilst mmap() writes
> > require internal dirty tracking and fsync() to flush caches via
> > writeback?
> 
> dax_do_io() writes are not immediately persistent.  They bypass the
> cpu-cache and cpu-write-bufffer and are ready to be flushed to media
> by REQ_FLUSH or power-fail on an ADR system.

IOWs, on an ADR system  write is /effectively/ immediately persistent
because if power fails ADR guarantees it will be flushed to stable
media, while on a posted write system it is volatile and will be
lost. Right?

If so, that's even worse than just having mmap/write behave
differently - now writes will behave differently depending on the
specific hardware installed. I think this makes it even more
important for the DAX code to hide this behaviour from the
fielsystems by treating everything as volatile.

If we track the dirty blocks from write in the radix tree like we
for mmap, then we can just use a normal memcpy() in dax_do_io(),
getting rid of the slow cache bypass that is currently run. Radix
tree updates are much less expensive than a slow memcpy of large
amounts of data, ad fsync can then take care of persistence, just
like we do for mmap.

We should just make the design assumption that all persistent memory
is volatile, track where we dirty it in all paths, and use the
fastest volatile memcpy primitives available to us in the IO path.
We'll end up with a faster fastpath that if we use CPU cache bypass
copies, dax_do_io() and mmap will be coherent and synchronised, and
fsync() will have the same requirements and overhead regardless of
the way the application modifies the pmem or the hardware platform
used to implement the pmem.

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

  parent reply	other threads:[~2016-07-30  0:12 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-27 12:07 Subtle races between DAX mmap fault and write path Jan Kara
2016-07-27 21:10 ` Ross Zwisler
2016-07-27 22:19   ` Dave Chinner
2016-07-28  8:10     ` Jan Kara
     [not found]       ` <20160728081033.GC4094-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2016-07-29  2:21         ` Dave Chinner
2016-07-29 14:44           ` Dan Williams
     [not found]             ` <CAPcyv4gOcDGzikJHYGxNXtYqQKkPUgkG+z4ASxogQUnp1zmD2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-07-30  0:12               ` Dave Chinner [this message]
2016-07-30  0:53                 ` Dan Williams
2016-08-01  1:46                   ` Dave Chinner
2016-08-01  3:13                     ` Keith Packard
     [not found]                       ` <86k2g15gh8.fsf-6d7jPg3SX/+z9DMzp4kqnw@public.gmane.org>
2016-08-01  4:07                         ` Dave Chinner
2016-08-01  4:39                           ` Dan Williams
2016-08-01  7:39                             ` Dave Chinner
2016-08-01 10:13                 ` Boaz Harrosh
     [not found]                   ` <579F20D9.80107-/8YdC2HfS5554TAoqtyWWQ@public.gmane.org>
2016-08-02  0:21                     ` Dave Chinner
2016-08-04 18:40                       ` Kani, Toshimitsu
     [not found]                         ` <1470335997.8908.128.camel-ZPxbGqLxI0U@public.gmane.org>
2016-08-05 11:27                           ` Dave Chinner
2016-08-05 15:18                             ` Kani, Toshimitsu
2016-08-05 19:58                             ` Boylston, Brian
     [not found]                               ` <CS1PR84MB0119314ACA9B4823C0FE33318E180-v3YevoQr3hP2N4EGskIB0ticc1VoeDReZmpNikb/MY7jO8Y7rvWZVA@public.gmane.org>
2016-08-08  9:26                                 ` Jan Kara
     [not found]                                   ` <20160808092655.GA29128-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2016-08-08 12:30                                     ` Boylston, Brian
2016-08-08 13:11                                       ` Christoph Hellwig
     [not found]                                       ` <CS1PR84MB0119ACB424699154BDA197B28E1B0-v3YevoQr3hP2N4EGskIB0ticc1VoeDReZmpNikb/MY7jO8Y7rvWZVA@public.gmane.org>
2016-08-08 18:28                                         ` Jan Kara
     [not found]                                           ` <20160808182827.GI29128-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2016-08-08 19:32                                             ` Kani, Toshimitsu
2016-08-08 23:12                                 ` Dave Chinner
2016-08-09  1:00                                   ` Kani, Toshimitsu
     [not found]                                     ` <1470704418.32015.51.camel-ZPxbGqLxI0U@public.gmane.org>
2016-08-09  5:58                                       ` Dave Chinner
2016-08-01 17:47                 ` Dan Williams
2016-07-28  8:47   ` Jan Kara
     [not found] ` <20160727120745.GI6860-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2016-07-27 21:38   ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160730001249.GE16044@dastard \
    --to=david-fqsqvqoi3ljby3ivrkzq2a@public.gmane.org \
    --cc=dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org \
    --cc=jack-AlSwsSmVLrQ@public.gmane.org \
    --cc=linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org \
    --cc=xfs-VZNHf3L845pBDgjK7y7TUQ@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).