From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>, Jens Axboe <axboe@fb.com>,
Jan Kara <jack@suse.cz>,
"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Jeff Moyer <jmoyer@redhat.com>, Jan Kara <jack@suse.com>,
Ross Zwisler <ross.zwisler@linux.intel.com>,
Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
Date: Tue, 3 Nov 2015 10:57:57 -0700 [thread overview]
Message-ID: <20151103175757.GA23366@linux.intel.com> (raw)
In-Reply-To: <CAPcyv4iwiTMMWGE63KX_tzrH1_pEpPxzAvRNgpaDEXAOhXU1BA@mail.gmail.com>
On Mon, Nov 02, 2015 at 09:31:11PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote:
> >> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
> >> > The zeroing (and the data, for that matter) doesn't need to be
> >> > committed to persistent store until the allocation is written and
> >> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
> >> > write, so it makes sense to deploy the big hammer and delay the
> >> > blocking CPU cache flushes until the last possible moment in cases
> >> > like this.
> >>
> >> In pmem terms that would be a non-temporal memset plus a delayed
> >> wmb_pmem at REQ_FLUSH time. Better to write around the cache than
> >> loop over the dirty-data issuing flushes after the fact. We'll bump
> >> the priority of the non-temporal memset implementation.
> >
> > Why is it better to do two synchronous physical writes to memory
> > within a couple of microseconds of CPU time rather than writing them
> > through the cache and, in most cases, only doing one physical write
> > to memory in a separate context that expects to wait for a flush
> > to complete?
>
> With a switch to non-temporal writes they wouldn't be synchronous,
> although it's doubtful that the subsequent writes after zeroing would
> also hit the store buffer.
>
> If we had a method to flush by physical-cache-way rather than a
> virtual address then it would indeed be better to save up for one
> final flush, but when we need to resort to looping through all the
> virtual addresses that might have touched it gets expensive.
I agree with the idea that we should avoid the "big hammer" flushing in
response to REQ_FLUSH. Here are the steps that are needed to make sure that
something is durable on media with PMEM/DAX:
1) Write, either with non-temporal stores or with stores that use the
processor cache
2) If you wrote using the processor cache, flush or write back the processor
cache
3) wmb_pmem(), synchronizing all non-temporal writes and flushes durably to
media.
PMEM does all I/O using 1 and 3 with non-temporal stores, and mmaps that go to
userspace can used cached writes, so on fsync/msync we do a bunch of flushes
for step 2. In either case I think we should have the PMEM driver just do
step 3, the wmb_pmem(), in response to REQ_FLUSH. This allows the zeroing
code to just do non-temporal writes of zeros, the DAX fsync/msync code to just
do flushes (which is what my patch set already does), and just leave the
wmb_pmem() to the PMEM driver at REQ_FLUSH time.
This makes the burden of REQ_FLUSH bearable for the PMEM driver, allowing us
to avoid looping through potentially terabytes of PMEM on each REQ_FLUSH bio.
This just means that the layers above the PMEM code either need to use
non-temporal writes for their I/Os, or do flushing, which I don't think is too
onerous.
WARNING: multiple messages have this Message-ID (diff)
From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>, Jens Axboe <axboe@fb.com>,
Jan Kara <jack@suse.cz>,
"linux-nvdimm@lists.01.org" <linux-nvdimm@ml01.01.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Jeff Moyer <jmoyer@redhat.com>, Jan Kara <jack@suse.com>,
Ross Zwisler <ross.zwisler@linux.intel.com>,
Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
Date: Tue, 3 Nov 2015 10:57:57 -0700 [thread overview]
Message-ID: <20151103175757.GA23366@linux.intel.com> (raw)
In-Reply-To: <CAPcyv4iwiTMMWGE63KX_tzrH1_pEpPxzAvRNgpaDEXAOhXU1BA@mail.gmail.com>
On Mon, Nov 02, 2015 at 09:31:11PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote:
> >> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
> >> > The zeroing (and the data, for that matter) doesn't need to be
> >> > committed to persistent store until the allocation is written and
> >> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
> >> > write, so it makes sense to deploy the big hammer and delay the
> >> > blocking CPU cache flushes until the last possible moment in cases
> >> > like this.
> >>
> >> In pmem terms that would be a non-temporal memset plus a delayed
> >> wmb_pmem at REQ_FLUSH time. Better to write around the cache than
> >> loop over the dirty-data issuing flushes after the fact. We'll bump
> >> the priority of the non-temporal memset implementation.
> >
> > Why is it better to do two synchronous physical writes to memory
> > within a couple of microseconds of CPU time rather than writing them
> > through the cache and, in most cases, only doing one physical write
> > to memory in a separate context that expects to wait for a flush
> > to complete?
>
> With a switch to non-temporal writes they wouldn't be synchronous,
> although it's doubtful that the subsequent writes after zeroing would
> also hit the store buffer.
>
> If we had a method to flush by physical-cache-way rather than a
> virtual address then it would indeed be better to save up for one
> final flush, but when we need to resort to looping through all the
> virtual addresses that might have touched it gets expensive.
I agree with the idea that we should avoid the "big hammer" flushing in
response to REQ_FLUSH. Here are the steps that are needed to make sure that
something is durable on media with PMEM/DAX:
1) Write, either with non-temporal stores or with stores that use the
processor cache
2) If you wrote using the processor cache, flush or write back the processor
cache
3) wmb_pmem(), synchronizing all non-temporal writes and flushes durably to
media.
PMEM does all I/O using 1 and 3 with non-temporal stores, and mmaps that go to
userspace can used cached writes, so on fsync/msync we do a bunch of flushes
for step 2. In either case I think we should have the PMEM driver just do
step 3, the wmb_pmem(), in response to REQ_FLUSH. This allows the zeroing
code to just do non-temporal writes of zeros, the DAX fsync/msync code to just
do flushes (which is what my patch set already does), and just leave the
wmb_pmem() to the PMEM driver at REQ_FLUSH time.
This makes the burden of REQ_FLUSH bearable for the PMEM driver, allowing us
to avoid looping through potentially terabytes of PMEM on each REQ_FLUSH bio.
This just means that the layers above the PMEM code either need to use
non-temporal writes for their I/Os, or do flushing, which I don't think is too
onerous.
next prev parent reply other threads:[~2015-11-03 17:57 UTC|newest]
Thread overview: 95+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-11-02 4:29 [PATCH v3 00/15] block, dax updates for 4.4 Dan Williams
2015-11-02 4:29 ` Dan Williams
2015-11-02 4:29 ` [PATCH v3 01/15] pmem, dax: clean up clear_pmem() Dan Williams
2015-11-02 4:29 ` Dan Williams
2015-11-02 4:29 ` [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations Dan Williams
2015-11-02 4:29 ` Dan Williams
2015-11-03 0:51 ` Dave Chinner
2015-11-03 0:51 ` Dave Chinner
2015-11-03 3:27 ` Dan Williams
2015-11-03 3:27 ` Dan Williams
2015-11-03 4:48 ` Dave Chinner
2015-11-03 4:48 ` Dave Chinner
2015-11-03 5:31 ` Dan Williams
2015-11-03 5:31 ` Dan Williams
2015-11-03 5:52 ` Dave Chinner
2015-11-03 5:52 ` Dave Chinner
2015-11-03 7:24 ` Dan Williams
2015-11-03 7:24 ` Dan Williams
2015-11-03 16:21 ` Jan Kara
2015-11-03 16:21 ` Jan Kara
2015-11-03 17:57 ` Ross Zwisler [this message]
2015-11-03 17:57 ` Ross Zwisler
2015-11-03 20:59 ` Dave Chinner
2015-11-03 20:59 ` Dave Chinner
2015-11-02 4:29 ` [PATCH v3 03/15] block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() Dan Williams
2015-11-02 4:29 ` Dan Williams
2015-11-03 19:01 ` Ross Zwisler
2015-11-03 19:01 ` Ross Zwisler
2015-11-03 19:09 ` Jeff Moyer
2015-11-03 22:50 ` Dan Williams
2015-11-03 22:50 ` Dan Williams
2016-01-18 10:42 ` Geert Uytterhoeven
2016-01-18 10:42 ` Geert Uytterhoeven
2015-11-02 4:30 ` [PATCH v3 04/15] libnvdimm, pmem: move request_queue allocation earlier in probe Dan Williams
2015-11-02 4:30 ` Dan Williams
2015-11-03 19:15 ` Ross Zwisler
2015-11-03 19:15 ` Ross Zwisler
2015-11-02 4:30 ` [PATCH v3 05/15] libnvdimm, pmem: fix size trim in pmem_direct_access() Dan Williams
2015-11-02 4:30 ` Dan Williams
2015-11-03 19:32 ` Ross Zwisler
2015-11-03 19:32 ` Ross Zwisler
2015-11-03 21:39 ` Dan Williams
2015-11-03 21:39 ` Dan Williams
2015-11-02 4:30 ` [PATCH v3 06/15] um: kill pfn_t Dan Williams
2015-11-02 4:30 ` Dan Williams
2015-11-02 4:30 ` [PATCH v3 07/15] kvm: rename pfn_t to kvm_pfn_t Dan Williams
2015-11-02 4:30 ` Dan Williams
2015-11-02 4:30 ` [PATCH v3 08/15] mm, dax, pmem: introduce pfn_t Dan Williams
2015-11-02 4:30 ` Dan Williams
2015-11-02 16:30 ` Joe Perches
2015-11-02 16:30 ` Joe Perches
2015-11-02 4:30 ` [PATCH v3 09/15] block: notify queue death confirmation Dan Williams
2015-11-02 4:30 ` Dan Williams
2015-11-02 4:30 ` [PATCH v3 10/15] dax, pmem: introduce zone_device_revoke() and devm_memunmap_pages() Dan Williams
2015-11-02 4:30 ` Dan Williams
2015-11-02 4:30 ` [PATCH v3 11/15] block: introduce bdev_file_inode() Dan Williams
2015-11-02 4:30 ` Dan Williams
2015-11-02 4:30 ` [PATCH v3 12/15] block: enable dax for raw block devices Dan Williams
2015-11-02 4:30 ` Dan Williams
2015-11-02 4:30 ` [PATCH v3 13/15] block, dax: make dax mappings opt-in by default Dan Williams
2015-11-02 4:30 ` Dan Williams
2015-11-03 0:32 ` Dave Chinner
2015-11-03 0:32 ` Dave Chinner
2015-11-03 7:35 ` Dan Williams
2015-11-03 7:35 ` Dan Williams
2015-11-03 20:20 ` Dave Chinner
2015-11-03 20:20 ` Dave Chinner
2015-11-03 23:04 ` Dan Williams
2015-11-03 23:04 ` Dan Williams
2015-11-04 19:23 ` Dan Williams
2015-11-04 19:23 ` Dan Williams
2015-11-02 4:30 ` [PATCH v3 14/15] dax: dirty extent notification Dan Williams
2015-11-02 4:30 ` Dan Williams
2015-11-03 1:16 ` Dave Chinner
2015-11-03 1:16 ` Dave Chinner
2015-11-03 4:56 ` Dan Williams
2015-11-03 4:56 ` Dan Williams
2015-11-03 5:40 ` Dave Chinner
2015-11-03 5:40 ` Dave Chinner
2015-11-03 7:20 ` Dan Williams
2015-11-03 7:20 ` Dan Williams
2015-11-03 20:51 ` Dave Chinner
2015-11-03 20:51 ` Dave Chinner
2015-11-03 21:19 ` Dan Williams
2015-11-03 21:19 ` Dan Williams
2015-11-03 21:37 ` Ross Zwisler
2015-11-03 21:37 ` Ross Zwisler
2015-11-03 21:43 ` Dan Williams
2015-11-03 21:43 ` Dan Williams
2015-11-03 21:18 ` Ross Zwisler
2015-11-03 21:18 ` Ross Zwisler
2015-11-03 21:34 ` Dan Williams
2015-11-03 21:34 ` Dan Williams
2015-11-02 4:31 ` [PATCH v3 15/15] pmem: blkdev_issue_flush support Dan Williams
2015-11-02 4:31 ` Dan Williams
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20151103175757.GA23366@linux.intel.com \
--to=ross.zwisler@linux.intel.com \
--cc=axboe@fb.com \
--cc=dan.j.williams@intel.com \
--cc=david@fromorbit.com \
--cc=hch@lst.de \
--cc=jack@suse.com \
--cc=jack@suse.cz \
--cc=jmoyer@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvdimm@lists.01.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.