Re: Subtle races between DAX mmap fault and write path

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: "Boylston, Brian" <brian.boylston@hpe.com>
Cc: "Kani, Toshimitsu" <toshi.kani@hpe.com>,
	"jack@suse.cz" <jack@suse.cz>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	"xfs@oss.sgi.com" <xfs@oss.sgi.com>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>
Subject: Re: Subtle races between DAX mmap fault and write path
Date: Tue, 9 Aug 2016 09:12:25 +1000	[thread overview]
Message-ID: <20160808231225.GD19025@dastard> (raw)
In-Reply-To: <CS1PR84MB0119314ACA9B4823C0FE33318E180@CS1PR84MB0119.NAMPRD84.PROD.OUTLOOK.COM>

On Fri, Aug 05, 2016 at 07:58:33PM +0000, Boylston, Brian wrote:
> Dave Chinner wrote on 2016-08-05:
> > [ cut to just the important points ]
> > On Thu, Aug 04, 2016 at 06:40:42PM +0000, Kani, Toshimitsu wrote:
> >> On Tue, 2016-08-02 at 10:21 +1000, Dave Chinner wrote:
> >>> If I drop the fsync from the
> >>> buffered IO path, bandwidth remains the same but runtime drops to
> >>> 0.55-0.57s, so again the buffered IO write path is faster than DAX
> >>> while doing more work.
> >> 
> >> I do not think the test results are relevant on this point because both
> >> buffered and dax write() paths use uncached copy to avoid clflush. ï¿½The
> >> buffered path uses cached copy to the page cache and then use uncached copy to
> >> PMEM via writeback. ï¿½Therefore, the buffered IO path also benefits from using
> >> uncached copy to avoid clflush.
> > 
> > Except that I tested without the writeback path for buffered IO, so
> > there was a direct comparison for single cached copy vs single
> > uncached copy.
> > 
> > The undenial fact is that a write() with a single cached copy with
> > all the overhead of dirty page tracking is /faster/ than a much
> > shorter, simpler IO path that uses an uncached copy. That's what the
> > numbers say....
> > 
> >> Cached copy (req movq) is slightly faster than uncached copy,
> > 
> > Not according to Boaz - he claims that uncached is 20% faster than
> > cached. How about you two get together, do some benchmarking and get
> > your story straight, eh?
> > 
> >> and should be
> >> used for writing to the page cache. ï¿½For writing to PMEM, however, additional
> >> clflush can be expensive, and allocating cachelines for PMEM leads to evict
> >> application's cachelines.
> > 
> > I keep hearing people tell me why cached copies are slower, but
> > no-one is providing numbers to back up their statements. The only
> > numbers we have are the ones I've published showing cached copies w/
> > full dirty tracking is faster than uncached copy w/o dirty tracking.
> > 
> > Show me the numbers that back up your statements, then I'll listen
> > to you.
> 
> Here are some numbers for a particular scenario, and the code is below.
> 
> Time (in seconds) to copy a 16KiB buffer 1M times to a 4MiB NVDIMM buffer
> (1M total memcpy()s).  For the cached+clflush case, the flushes are done
> every 4MiB (which seems slightly faster than flushing every 16KiB):
> 
>                   NUMA local    NUMA remote
> Cached+clflush      13.5           37.1
> movnt                1.0            1.3 

So let's put that in memory bandwidth terms. You wrote 16GB to the
NVDIMM.  That means:

                  NUMA local    NUMA remote
Cached+clflush      1.2GB/s         0.43GB/s
movnt              16.0GB/s         12.3GB/s

That smells wrong.  The DAX code (using movnt) is not 1-2 orders of
magnitude faster than a page cache copy, so I don't believe your
benchmark reflects what I'm proposing.

What I think you're getting wrong is that we are not doing a clflush
after every 16k write when we use the page cache, nor will we do
that if we use cached copies, dirty tracking and clflush on fsync().
IOWs, the correct equivalent "cached + clflush" loop to a volatile
copy with dirty tracking + fsync would be:

	dstp = dst;
	while (--nloops) {
		memcpy(dstp, src, src_sz);	// pwrite();
		dstp += src_sz;
	}
        pmem_persist(dst, dstsz);	// fsync();

i.e. The cache flushes occur only at the user defined
synchronisation point not on every syscall.

Yes, if you want to make your copy slow and safe, use O_SYNC to
trigger clflush on every write() call - that's what we do for
existing storage and the mechanisms are already there; we just need
the dirty tracking to optimise it.

Put simple: we should only care about cache flush synchronisation at
user defined data integrity synchronisation points. That's the IO
model the kernel has always exposed to users, and pmem storage is no
different.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2016-08-08 23:12 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-27 12:07 Subtle races between DAX mmap fault and write path Jan Kara
2016-07-27 21:10 ` Ross Zwisler
2016-07-27 22:19   ` Dave Chinner
2016-07-28  8:10     ` Jan Kara
2016-07-29  2:21       ` Dave Chinner
2016-07-29 14:44         ` Dan Williams
2016-07-30  0:12           ` Dave Chinner
2016-07-30  0:53             ` Dan Williams
2016-08-01  1:46               ` Dave Chinner
2016-08-01  3:13                 ` Keith Packard
2016-08-01  4:07                   ` Dave Chinner
2016-08-01  4:39                     ` Dan Williams
2016-08-01  7:39                       ` Dave Chinner
2016-08-01 10:13             ` Boaz Harrosh
2016-08-02  0:21               ` Dave Chinner
2016-08-04 18:40                 ` Kani, Toshimitsu
2016-08-05 11:27                   ` Dave Chinner
2016-08-05 15:18                     ` Kani, Toshimitsu
2016-08-05 19:58                     ` Boylston, Brian
2016-08-08  9:26                       ` Jan Kara
2016-08-08 12:30                         ` Boylston, Brian
2016-08-08 13:11                           ` Christoph Hellwig
2016-08-08 18:28                           ` Jan Kara
2016-08-08 19:32                             ` Kani, Toshimitsu
2016-08-08 23:12                       ` Dave Chinner [this message]
2016-08-09  1:00                         ` Kani, Toshimitsu
2016-08-09  5:58                           ` Dave Chinner
2016-08-01 17:47             ` Dan Williams
2016-07-28  8:47   ` Jan Kara
2016-07-27 21:38 ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160808231225.GD19025@dastard \
    --to=david@fromorbit.com \
    --cc=brian.boylston@hpe.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=toshi.kani@hpe.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).