From: Dave Chinner <david@fromorbit.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>,
"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
Brian Foster <bfoster@redhat.com>,
xfs@oss.sgi.com, linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Ross Zwisler <ross.zwisler@linux.intel.com>
Subject: Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
Date: Tue, 3 Nov 2015 16:04:13 +1100 [thread overview]
Message-ID: <20151103050413.GB19199@dastard> (raw)
In-Reply-To: <CAPcyv4i_D6TuV8B6WF-5JoBdgh9FZbeBim8=s45RnQfhWAVpYg@mail.gmail.com>
On Mon, Nov 02, 2015 at 07:53:27PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote:
> > [add people to the cc list]
> >
> > On Mon, Nov 02, 2015 at 09:15:10AM -0500, Brian Foster wrote:
> >> On Mon, Nov 02, 2015 at 12:14:33PM +1100, Dave Chinner wrote:
> >> > On Fri, Oct 30, 2015 at 08:36:57AM -0400, Brian Foster wrote:
> >> > > Unless there is some
> >> > > special mixed dio/mmap case I'm missing, doing so for DAX/DIO basically
> >> > > causes a clear_pmem() over every page sized chunk of the target I/O
> >> > > range for which we already have the data.
> >> >
> >> > I don't follow - this only zeros blocks when we do allocation of new
> >> > blocks or overwrite unwritten extents, not on blocks which we
> >> > already have written data extents allocated for...
> >> >
> >>
> >> Why are we assuming that block zeroing is more efficient than unwritten
> >> extents for DAX/dio? I haven't played with pmem enough to know for sure
> >> one way or another (or if hw support is imminent), but I'd expect the
> >> latter to be more efficient in general without any kind of hardware
> >> support.
> >>
> >> Just as an example, here's an 8GB pwrite test, large buffer size, to XFS
> >> on a ramdisk mounted with '-o dax:'
> >>
> >> - Before this series:
> >>
> >> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
> >> wrote 8589934592/8589934592 bytes at offset 0
> >> 8.000 GiB, 820 ops; 0:00:04.00 (1.909 GiB/sec and 195.6591 ops/sec)
> >>
> >> - After this series:
> >>
> >> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
> >> wrote 8589934592/8589934592 bytes at offset 0
> >> 8.000 GiB, 820 ops; 0:00:12.00 (659.790 MiB/sec and 66.0435 ops/sec)
> >
> > That looks wrong. Much, much slower than it should be just zeroing
> > pages and then writing to them again while cache hot.
> >
> > Oh, hell - dax_clear_blocks() is stupidly slow. A profile shows this
> > loop spending most of the CPU time:
> >
> > ¿ ¿ jbe ea
> > ¿ de: clflus %ds:(%rax)
> > 84.67 ¿ add %rsi,%rax
> > ¿ cmp %rax,%rdx
> > ¿ ¿ ja de
> > ¿ ea: add %r13,-0x38(%rbp)
> > ¿ sub %r12,%r14
> > ¿ sub %r12,-0x40(%rbp)
> >
> > That is the overhead of __arch_wb_cache_pmem() i.e. issuing CPU
> > cache flushes after each memset.
>
> Ideally this would be non-temporal and skip the second flush loop
> altogether. Outside of that another problem is that this cpu does not
> support the clwb instruction and is instead using the serializing and
> invalidating clflush instruction.
Sure, it can be optimised to improve behaviour, and other hardware
will be more performant, but we'll still be doing synchronous cache
flushing operations here and so the fundamental problem still
exists.
> > This comes back to the comments I made w.r.t. the pmem driver
> > implementation doing synchronous IO by immediately forcing CPU cache
> > flushes and barriers. it's obviously correct, but it looks like
> > there's going to be a major performance penalty associated with it.
> > This is why I recently suggested that a pmem driver that doesn't do
> > CPU cache writeback during IO but does it on REQ_FLUSH is an
> > architecture we'll likely have to support.
> >
>
> The only thing we can realistically delay is wmb_pmem() i.e. the final
> sync waiting for data that has *left* the cpu cache. Unless/until we
> get a architecturally guaranteed method to write-back the entire
> cache, or flush the cache by physical-cache-way we're stuck with
> either non-temporal cycles or looping on potentially huge virtual
> address ranges.
I'm missing something: why won't flushing the address range returned
by bdev_direct_access() during a fsync operation work? i.e. we're
working with exactly the same address as dax_clear_blocks() and
dax_do_io() use, so why can't we look up that address and flush it
from fsync?
> > This, however, is not really a problem for the filesystem - it's
> > a pmem driver architecture problem. ;)
> >
>
> It's a platform problem. Let's see how this looks when not using
> clflush instructions.
Yes, clflush vs clwb is that's a platform problem.
However, the architectural problem I've refered to is that is we've
designed the DAX infrastructure around a CPU implementation that
requires synchronous memory flushes for persistence, and so driven
that synchronous flush requirement into the DAX implementation
itself....
> Also, another benefit of pushing zeroing down into the driver is that
> for brd, as used in this example, it will rightly be a nop because
> there's no persistence to guarantee there.
Precisely my point - different drivers and hardware have different
semantics and optimisations, and only by separating the data copying
from the persistence model do we end up with infrastructure that is
flexible enough to work with different hardware/driver pmem models...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2015-11-03 5:04 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-10-19 3:27 [PATCH 0/6 V2] xfs: upfront block zeroing for DAX Dave Chinner
2015-10-19 3:27 ` [PATCH 1/6] xfs: fix inode size update overflow in xfs_map_direct() Dave Chinner
2015-10-29 14:27 ` Brian Foster
2015-10-19 3:27 ` [PATCH 2/6] xfs: introduce BMAPI_ZERO for allocating zeroed extents Dave Chinner
2015-10-29 14:27 ` Brian Foster
2015-10-29 23:35 ` Dave Chinner
2015-10-30 12:36 ` Brian Foster
2015-11-02 1:21 ` Dave Chinner
2015-10-19 3:27 ` [PATCH 3/6] xfs: Don't use unwritten extents for DAX Dave Chinner
2015-10-29 14:29 ` Brian Foster
2015-10-29 23:37 ` Dave Chinner
2015-10-30 12:36 ` Brian Foster
2015-11-02 1:14 ` Dave Chinner
2015-11-02 14:15 ` Brian Foster
2015-11-02 21:44 ` Dave Chinner
2015-11-03 3:53 ` Dan Williams
2015-11-03 5:04 ` Dave Chinner [this message]
2015-11-04 0:50 ` Ross Zwisler
2015-11-04 1:02 ` Dan Williams
2015-11-04 4:46 ` Ross Zwisler
2015-11-04 9:06 ` Jan Kara
2015-11-04 15:35 ` Ross Zwisler
2015-11-04 17:21 ` Jan Kara
2015-11-03 9:16 ` Jan Kara
2015-10-19 3:27 ` [PATCH 4/6] xfs: DAX does not use IO completion callbacks Dave Chinner
2015-10-29 14:29 ` Brian Foster
2015-10-29 23:39 ` Dave Chinner
2015-10-30 12:37 ` Brian Foster
2015-10-19 3:27 ` [PATCH 5/6] xfs: add ->pfn_mkwrite support for DAX Dave Chinner
2015-10-29 14:30 ` Brian Foster
2015-10-19 3:27 ` [PATCH 6/6] xfs: xfs_filemap_pmd_fault treats read faults as write faults Dave Chinner
2015-10-29 14:30 ` Brian Foster
2015-11-05 23:48 ` [PATCH 0/6 V2] xfs: upfront block zeroing for DAX Ross Zwisler
2015-11-06 22:32 ` Dave Chinner
2015-11-06 18:12 ` Boylston, Brian
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20151103050413.GB19199@dastard \
--to=david@fromorbit.com \
--cc=bfoster@redhat.com \
--cc=dan.j.williams@intel.com \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-nvdimm@lists.01.org \
--cc=ross.zwisler@linux.intel.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox