linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Jan Kara <jack@suse.cz>, linux-nvdimm <linux-nvdimm@lists.01.org>,
	Linux API <linux-api@vger.kernel.org>,
	Dave Chinner <david@fromorbit.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Jeff Moyer <jmoyer@redhat.com>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
Date: Sat, 17 Jun 2017 14:52:38 -0700	[thread overview]
Message-ID: <CAPcyv4j4UEegViDJcLZjVv5AFGC18-DcvHFnhZatB0hH3BY85g@mail.gmail.com> (raw)
In-Reply-To: <CALCETrU1Hg=q4cdQDex--3nVBfwRC1o=9pC6Ss77Z8Lxg7ZJLg@mail.gmail.com>

On Sat, Jun 17, 2017 at 9:25 AM, Andy Lutomirski <luto@kernel.org> wrote:
> On Fri, Jun 16, 2017 at 6:15 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> To date, the full promise of byte-addressable access to persistent
>> memory has only been half realized via the filesystem-dax interface. The
>> current filesystem-dax mechanism allows an application to consume (read)
>> data from persistent storage at byte-size granularity, bypassing the
>> full page reads required by traditional storage devices.
>>
>> Now, for writes, applications still need to contend with
>> page-granularity dirtying and flushing semantics as well as filesystem
>> coordination for metadata updates after any mmap write. The current
>> situation precludes use cases that leverage byte-granularity / in-place
>> updates to persistent media.
>>
>> To get around this limitation there are some specialized applications
>> that are using the device-dax interface to bypass the overhead and
>> data-safety problems of the current filesystem-dax mmap-write path.
>> QEMU-KVM is forced to use device-dax to safely pass through persistent
>> memory to a guest [1]. Some specialized databases are using device-dax
>> for byte-granularity writes. Outside of those cases, device-dax is
>> difficult for general purpose persistent memory applications to consume.
>> There is demand for access to pmem without needing to contend with
>> special device configuration and other device-dax limitations.
>>
>> The 'daxfile' interface satisfies this demand and realizes one of Dave
>> Chinner's ideas for allowing pmem applications to safely bypass
>> fsync/msync requirements. The idea is to make the file immutable with
>> respect to the offset-to-block mappings for every extent in the file
>> [2]. It turns out that filesystems already need to make this guarantee
>> today. This property is needed for files marked as swap files.
>>
>> The new daxctl() syscall manages setting a file into 'static-dax' mode
>> whereby it arranges for the file to be treated as a swapfile as far as
>> the filesystem is concerned, but not registered with the core-mm as
>> swapfile space. A file in this mode is then safe to be mapped and
>> written without the requirement to fsync/msync the writes.  The cpu
>> cache management for flushing data to persistence can be handled
>> completely in userspace.
>
> Can you remind those of us who haven't played with DAX in a while what
> the problem is with mmapping a DAX file without this patchset?  If
> there's some bookkkeeping needed to make sure that the filesystem will
> invalidate all the mappings if it decides to move the file, maybe that
> should be the default rather than needing a new syscall.

The bookkeeping to invalidate mappings when the filesystem moves a
block is already there.

Without this patchset an application needs to call fsync/msync after
any write to a DAX mapping otherwise there is no guarantee the
filesystem has written the metadata to find the updated block after a
crash or power loss event. Even if the sync operation is reduced to a
minimal cmpxchg in userspace to check if the filesystem-metadata is
dirty, that mechanism doesn't translate to a virtualized environment,
as requiring guests to trigger host fsync()s is not feasible. It's a
half-step solution when you can instead just ask the filesystem to
never move blocks, as Dave proposed many months back.

We stepped back from that proposal when it looked like a significant
amount of per-filesystem work to introduce the capability and it was
not clear that application developers would tolerate the side effects
of this 'immutable' semantic. However, the implementation is dead
simple since ext4 and xfs already need to make
block-allocation-immutable semantics available for swapfiles. We also
have application developers telling us they are ok with the semantics,
especially because it catches Linux up to other operating environments
that are already on board with allowing this type of access to pmem
through a filesystem. This patchset gives pmem application developers
what they want without any additional burden on filesystem
implementations.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-06-17 21:52 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-17  1:15 [RFC PATCH 0/2] daxfile: enable byte-addressable updates to pmem Dan Williams
     [not found] ` <149766212410.22552.15957843500156182524.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2017-06-17  1:15   ` [RFC PATCH 1/2] mm: introduce bmap_walk() Dan Williams
2017-06-17  5:22     ` Christoph Hellwig
     [not found]       ` <20170617052212.GA8246-jcswGhMUV9g@public.gmane.org>
2017-06-17 12:29         ` Dan Williams
2017-06-18  7:51           ` Christoph Hellwig
2017-06-19 16:18             ` Darrick J. Wong
     [not found]             ` <20170618075152.GA25871-jcswGhMUV9g@public.gmane.org>
2017-06-19 18:19               ` Al Viro
2017-06-20  7:34                 ` Christoph Hellwig
2017-06-17  1:15 ` [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem Dan Williams
     [not found]   ` <149766213493.22552.4057048843646200083.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2017-06-17 16:25     ` Andy Lutomirski
2017-06-17 21:52       ` Dan Williams [this message]
     [not found]         ` <CAPcyv4j4UEegViDJcLZjVv5AFGC18-DcvHFnhZatB0hH3BY85g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-06-17 23:50           ` Andy Lutomirski
2017-06-18  3:15             ` Dan Williams
2017-06-18  5:05               ` Andy Lutomirski
     [not found]                 ` <CALCETrVY38h2ajpod2U_2pdHSp8zO4mG2p19h=OnnHmhGTairw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-06-19 13:21                   ` Dave Chinner
2017-06-19 15:22                     ` Andy Lutomirski
     [not found]                       ` <CALCETrUe0igzK0RZTSSondkCY3ApYQti89tOh00f0j_APrf_dQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-06-20  0:46                         ` Dave Chinner
2017-06-20  5:53                           ` Andy Lutomirski
2017-06-20  8:49                             ` Christoph Hellwig
     [not found]                               ` <20170620084924.GA9752-jcswGhMUV9g@public.gmane.org>
2017-06-20 16:17                                 ` Dan Williams
     [not found]                                   ` <CAPcyv4jkH6iwDoG4NnCaTNXozwYgVXiJDe2iFSONcE63KvGQoA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-06-20 16:26                                     ` Andy Lutomirski
2017-06-20 23:53                                   ` Dave Chinner
2017-06-21  1:24                                     ` Darrick J. Wong
2017-06-21  2:19                                       ` Dave Chinner
     [not found]                             ` <CALCETrVuoPDRuuhc9X8eVCYiFUzWLSTRkcjbD6jas_2J2GixNQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-06-20 10:11                               ` Dave Chinner
2017-06-20 16:14                                 ` Andy Lutomirski
2017-06-21  1:40                                   ` Dave Chinner
2017-06-21  5:18                                     ` Andy Lutomirski
     [not found]                                       ` <CALCETrVYmbyNS-btvsN_M-QyWPZA_Y_4JXOM893g7nhZA+WviQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-06-22  0:02                                         ` Dave Chinner
2017-06-22  4:07                                           ` Andy Lutomirski
2017-06-23  0:52                                             ` Dave Chinner
2017-06-23  3:07                                               ` Andy Lutomirski
2017-06-18  8:18               ` Christoph Hellwig
     [not found]                 ` <20170618081850.GA26332-jcswGhMUV9g@public.gmane.org>
2017-06-19  1:51                   ` Dan Williams
2017-06-20  5:22   ` Darrick J. Wong
2017-06-20 15:42     ` Ross Zwisler
2017-06-22  7:09       ` Darrick J. Wong
     [not found]     ` <20170620052214.GA3787-PTl6brltDGh4DFYR7WNSRA@public.gmane.org>
2017-06-21 23:37       ` Dave Chinner
2017-06-22  7:23         ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPcyv4j4UEegViDJcLZjVv5AFGC18-DcvHFnhZatB0hH3BY85g@mail.gmail.com \
    --to=dan.j.williams@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=jmoyer@redhat.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=luto@kernel.org \
    --cc=ross.zwisler@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).