Re: [PATCH 5/5] block: enable dax for raw block devices

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"jmoyer@redhat.com" <jmoyer@redhat.com>,
	"hch@lst.de" <hch@lst.de>, "axboe@fb.com" <axboe@fb.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@ml01.01.org>,
	"willy@linux.intel.com" <willy@linux.intel.com>,
	"ross.zwisler@linux.intel.com" <ross.zwisler@linux.intel.com>
Subject: Re: [PATCH 5/5] block: enable dax for raw block devices
Date: Mon, 26 Oct 2015 17:23:19 +1100	[thread overview]
Message-ID: <20151026062319.GJ19199@dastard> (raw)
In-Reply-To: <CAPcyv4hjGYYPRyPjZc3CymmnSObB7mULRbeMZFjnKdoCD_m7pw@mail.gmail.com>

On Mon, Oct 26, 2015 at 11:48:06AM +0900, Dan Williams wrote:
> On Mon, Oct 26, 2015 at 6:22 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Oct 22, 2015 at 11:08:18PM +0200, Jan Kara wrote:
> >> Ugh2: Now I realized that DAX mmap isn't safe wrt fs freezing even for
> >> filesystems since there's nothing which writeprotects pages that are
> >> writeably mapped. In normal path, page writeback does this but that doesn't
> >> happen for DAX. I remember we once talked about this but it got lost.
> >> We need something like walk all filesystem inodes during fs freeze and
> >> writeprotect all pages that are mapped. But that's going to be slow...
> >
> > fsync() has the same problem - we have no record of the pages that
> > need to be committed and then write protected when fsync() is called
> > after write()...
> 
> I know Ross is still working on that implementation.  However, I had a
> thought on the flight to ksummit that maybe we shouldn't worry about
> tracking dirty state on a per-page basis.  For small / frequent
> synchronizations an application really should be using the nvml
> library [1] to issue cache flushes and pcommit from userspace on a
> per-cacheline basis.  That leaves unmodified apps that want to be
> correct in the presence of dax mappings.  Two things we can do to
> mitigate that case:
> 
> 1/ Make DAX mappings opt-in with a new MMAP_DAX (page-cache bypass)
> flag.  Applications shouldn't silently become incorrect simply because
> the fs is mounted with -o dax.  If an app doesn't understand DAX
> mappings it should get page-cache semantics.  This also protects apps
> that are not expecting DAX semantics on raw block device mappings.

Which is the complete opposite of what we are trying to acehive with
DAX. i.e. that existing applications "just work" with DAX without
modification. So this is a non-starter.

Also, DAX access isn't a property of mmap - it's a property
of the inode. We cannot do DAX access via mmap while mixing page
cache based access through file descriptor based interfaces. This
I why I'm adding an inode attribute (on disk) to enable per-file DAX
capabilities - either everything is via the DAX paths, or nothing
is.

> 2/ Even if we get a new flag that lets the kernel know the app
> understands DAX mappings, we shouldn't leave fsync broken.  Can we
> instead get by with a simple / big hammer solution?  I.e.

Because we don't physically have to write back data the problem is
both simpler and more complex. The simplest solution is for the
underlying block device to implement blkdev_issue_flush() correctly.

i.e. if blkdev_issue_flush() behaves according to it's required
semantics - that all volatile cached data is flushed to stable
storage - then fsync-on-DAX will work appropriately. As it is, this is
needed for journal based filesystems to work correctly, as they are
assuming that their journal writes are being treated correctly as
REQ_FLUSH | REQ_FUA to ensure correct data/metadata/journal
ordering is maintained....

So, to begin with, this problem needs to be solved at the block
device level. That's the simple, brute-force, big hammer solution to
the problem, and it requires no changes at the filesystem level at
all.

However, to avoid having to flush the entire block device range on
fsync we need a much more complex solution that tracks the dirty
ranges of the file and hence what needs committing when fsync is
run....

> Disruptive, yes, but if an app cares about efficient persistent memory
> synchronization fsync is already the wrong api.

I don't really care about efficiency right now - correctness comes
first. Fundamentally, the app should not care whether it is writing to
persistent memory or spinning rust - the filesystem needs to
provide the application with exactly the same integrity guarantees
regardless of the underlying storage.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2015-10-26  6:23 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-22  6:41 [PATCH 0/5] block, dax: updates for 4.4 Dan Williams
2015-10-22  6:41 ` [PATCH 1/5] pmem, dax: clean up clear_pmem() Dan Williams
2015-10-22  6:41 ` [PATCH 2/5] dax: increase granularity of dax_clear_blocks() operations Dan Williams
2015-10-22  9:26   ` Jan Kara
2015-10-22  6:41 ` [PATCH 3/5] block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() Dan Williams
2015-10-22  6:42 ` [PATCH 4/5] block: introduce file_bd_inode() Dan Williams
2015-10-22  9:45   ` Jan Kara
2015-10-22 15:41     ` Dan Williams
2015-10-22  6:42 ` [PATCH 5/5] block: enable dax for raw block devices Dan Williams
2015-10-22  9:35   ` Jan Kara
2015-10-22 16:05     ` Williams, Dan J
2015-10-22 21:08       ` Jan Kara
2015-10-22 23:41         ` Williams, Dan J
2015-10-24 12:21           ` Jan Kara
2015-10-23 23:32         ` Dan Williams
2015-10-24 14:49           ` Jan Kara
2015-10-25 21:22         ` Dave Chinner
2015-10-26  2:48           ` Dan Williams
2015-10-26  6:23             ` Dave Chinner [this message]
2015-10-26  7:20               ` Jan Kara
2015-10-26  8:56               ` Dan Williams
2015-10-26 22:19                 ` Dave Chinner
2015-10-27 22:55                   ` Ross Zwisler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151026062319.GJ19199@dastard \
    --to=david@fromorbit.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@fb.com \
    --cc=dan.j.williams@intel.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=jmoyer@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@ml01.01.org \
    --cc=ross.zwisler@linux.intel.com \
    --cc=willy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).