Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
       [not found]           ` <20151102141509.GA29346@bfoster.bfoster>
@ 2015-11-02 21:44             ` Dave Chinner
  2015-11-03  3:53               ` Dan Williams
  2015-11-03  9:16               ` Jan Kara
  0 siblings, 2 replies; 10+ messages in thread
From: Dave Chinner @ 2015-11-02 21:44 UTC (permalink / raw)
  To: Brian Foster
  Cc: ross.zwisler, jack, xfs, linux-fsdevel, dan.j.williams,
	linux-nvdimm

[add people to the cc list]

On Mon, Nov 02, 2015 at 09:15:10AM -0500, Brian Foster wrote:
> On Mon, Nov 02, 2015 at 12:14:33PM +1100, Dave Chinner wrote:
> > On Fri, Oct 30, 2015 at 08:36:57AM -0400, Brian Foster wrote:
> > > Unless there is some
> > > special mixed dio/mmap case I'm missing, doing so for DAX/DIO basically
> > > causes a clear_pmem() over every page sized chunk of the target I/O
> > > range for which we already have the data.
> > 
> > I don't follow - this only zeros blocks when we do allocation of new
> > blocks or overwrite unwritten extents, not on blocks which we
> > already have written data extents allocated for...
> > 
> 
> Why are we assuming that block zeroing is more efficient than unwritten
> extents for DAX/dio? I haven't played with pmem enough to know for sure
> one way or another (or if hw support is imminent), but I'd expect the
> latter to be more efficient in general without any kind of hardware
> support.
> 
> Just as an example, here's an 8GB pwrite test, large buffer size, to XFS
> on a ramdisk mounted with '-o dax:'
> 
> - Before this series:
> 
> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
> wrote 8589934592/8589934592 bytes at offset 0
> 8.000 GiB, 820 ops; 0:00:04.00 (1.909 GiB/sec and 195.6591 ops/sec)
> 
> - After this series:
> 
> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
> wrote 8589934592/8589934592 bytes at offset 0
> 8.000 GiB, 820 ops; 0:00:12.00 (659.790 MiB/sec and 66.0435 ops/sec)

That looks wrong. Much, much slower than it should be just zeroing
pages and then writing to them again while cache hot.

Oh, hell - dax_clear_blocks() is stupidly slow. A profile shows this
loop spending most of the CPU time:

       ¿     ¿ jbe    ea
       ¿ de:   clflus %ds:(%rax)
 84.67 ¿       add    %rsi,%rax
       ¿       cmp    %rax,%rdx
       ¿     ¿ ja     de
       ¿ ea:   add    %r13,-0x38(%rbp)
       ¿       sub    %r12,%r14
       ¿       sub    %r12,-0x40(%rbp)

That is the overhead of __arch_wb_cache_pmem() i.e. issuing CPU
cache flushes after each memset.

None of these pmem memory operations are optimised yet - the
implementations are correct, but performance still needs work. The
conversion to non-temporal stores should get rid of this cache flush
overhead (somewhat), but I was still expecting this code to be much
closer to memset speed and not reduce performance to bugger all...

> The impact is less with a smaller buffer size so the above is just meant
> to illustrate the point. FWIW, I'm also fine with getting this in as a
> matter of "correctness before performance" since this stuff is clearly
> still under development, but as far as I can see so far we should
> probably ultimately prefer unwritten extents for DAX/DIO (or at least
> plan to run some similar tests on real pmem hw). Thoughts?

We're going to have these problems initially, but from the XFS
persepective those problems don't matter because we have a different
problem to solve.  That is, we need to move towards an architecture
that is highly efficient for low latency, high IOPS storage
subsystems.  The first step towards that is to be able to offload
block zeroing to the hardware so we can avoid using unwritten
extents.

In the long run, we don't want to use unwritten extents on DAX if we
can avoid it - the CPU overhead of unwritten extent conversion isn't
constant (i.e. it grows with the size of the BMBT) and it isn't
deterministic (e.g. split/merge take much more CPU than a simple
record write to clear an unwritten flag). We don't notice this
overhead much with normal IO because of the fact that the CPU time
for conversion is much less than the CPU time for the IO to
complete, hence it's a win.

But for PMEM, directly zeroing a 4k chunk of data should take *much
less CPU* than synchronously starting a transaction, reserving
space, looking up the extent in the BMBT, loading the buffers from
cache, modifying the buffers, logging the changes and committing the
transaction (which in itself will typically copy more than a single
page of metadata into the CIL buffers).

Realistically, dax_clear_blocks() should probably be implemented at
the pmem driver layer through blkdev_issue_zeroout() because all it
does is directly map the sector/len to pfn via bdev_direct_access()
and then zero it - it's a sector based, block device operation. We
don't actually need a special case path for DAX here. Optimisation
of this operation has little to do with the filesystem.

This comes back to the comments I made w.r.t. the pmem driver
implementation doing synchronous IO by immediately forcing CPU cache
flushes and barriers. it's obviously correct, but it looks like
there's going to be a major performance penalty associated with it.
This is why I recently suggested that a pmem driver that doesn't do
CPU cache writeback during IO but does it on REQ_FLUSH is an
architecture we'll likely have to support.

In this case, the block device won't need to flush CPU cache lines
for the zeroed blocks until the allocation transaction is actually
committed to the journal. In that case, there's a good chance that
we'd end up also committing the new data as well, hence avoiding two
synchronous memory writes. i.e. the "big hammer" REQ_FLUSH
implementation may well be *much* faster than fine-grained
synchronous "stable on completion" writes for persistent memory.

This, however, is not really a problem for the filesystem - it's
a pmem driver architecture problem. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-11-02 21:44             ` [PATCH 3/6] xfs: Don't use unwritten extents for DAX Dave Chinner
@ 2015-11-03  3:53               ` Dan Williams
  2015-11-03  5:04                 ` Dave Chinner
  2015-11-03  9:16               ` Jan Kara
  1 sibling, 1 reply; 10+ messages in thread
From: Dan Williams @ 2015-11-03  3:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Brian Foster, Ross Zwisler, Jan Kara, xfs, linux-fsdevel,
	linux-nvdimm@lists.01.org

On Mon, Nov 2, 2015 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote:
> [add people to the cc list]
>
> On Mon, Nov 02, 2015 at 09:15:10AM -0500, Brian Foster wrote:
>> On Mon, Nov 02, 2015 at 12:14:33PM +1100, Dave Chinner wrote:
>> > On Fri, Oct 30, 2015 at 08:36:57AM -0400, Brian Foster wrote:
>> > > Unless there is some
>> > > special mixed dio/mmap case I'm missing, doing so for DAX/DIO basically
>> > > causes a clear_pmem() over every page sized chunk of the target I/O
>> > > range for which we already have the data.
>> >
>> > I don't follow - this only zeros blocks when we do allocation of new
>> > blocks or overwrite unwritten extents, not on blocks which we
>> > already have written data extents allocated for...
>> >
>>
>> Why are we assuming that block zeroing is more efficient than unwritten
>> extents for DAX/dio? I haven't played with pmem enough to know for sure
>> one way or another (or if hw support is imminent), but I'd expect the
>> latter to be more efficient in general without any kind of hardware
>> support.
>>
>> Just as an example, here's an 8GB pwrite test, large buffer size, to XFS
>> on a ramdisk mounted with '-o dax:'
>>
>> - Before this series:
>>
>> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
>> wrote 8589934592/8589934592 bytes at offset 0
>> 8.000 GiB, 820 ops; 0:00:04.00 (1.909 GiB/sec and 195.6591 ops/sec)
>>
>> - After this series:
>>
>> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
>> wrote 8589934592/8589934592 bytes at offset 0
>> 8.000 GiB, 820 ops; 0:00:12.00 (659.790 MiB/sec and 66.0435 ops/sec)
>
> That looks wrong. Much, much slower than it should be just zeroing
> pages and then writing to them again while cache hot.
>
> Oh, hell - dax_clear_blocks() is stupidly slow. A profile shows this
> loop spending most of the CPU time:
>
>        ¿     ¿ jbe    ea
>        ¿ de:   clflus %ds:(%rax)
>  84.67 ¿       add    %rsi,%rax
>        ¿       cmp    %rax,%rdx
>        ¿     ¿ ja     de
>        ¿ ea:   add    %r13,-0x38(%rbp)
>        ¿       sub    %r12,%r14
>        ¿       sub    %r12,-0x40(%rbp)
>
> That is the overhead of __arch_wb_cache_pmem() i.e. issuing CPU
> cache flushes after each memset.

Ideally this would be non-temporal and skip the second flush loop
altogether.  Outside of that another problem is that this cpu does not
support the clwb instruction and is instead using the serializing and
invalidating clflush instruction.


> None of these pmem memory operations are optimised yet - the
> implementations are correct, but performance still needs work. The
> conversion to non-temporal stores should get rid of this cache flush
> overhead (somewhat), but I was still expecting this code to be much
> closer to memset speed and not reduce performance to bugger all...
>
>> The impact is less with a smaller buffer size so the above is just meant
>> to illustrate the point. FWIW, I'm also fine with getting this in as a
>> matter of "correctness before performance" since this stuff is clearly
>> still under development, but as far as I can see so far we should
>> probably ultimately prefer unwritten extents for DAX/DIO (or at least
>> plan to run some similar tests on real pmem hw). Thoughts?
>
> We're going to have these problems initially, but from the XFS
> persepective those problems don't matter because we have a different
> problem to solve.  That is, we need to move towards an architecture
> that is highly efficient for low latency, high IOPS storage
> subsystems.  The first step towards that is to be able to offload
> block zeroing to the hardware so we can avoid using unwritten
> extents.
>
> In the long run, we don't want to use unwritten extents on DAX if we
> can avoid it - the CPU overhead of unwritten extent conversion isn't
> constant (i.e. it grows with the size of the BMBT) and it isn't
> deterministic (e.g. split/merge take much more CPU than a simple
> record write to clear an unwritten flag). We don't notice this
> overhead much with normal IO because of the fact that the CPU time
> for conversion is much less than the CPU time for the IO to
> complete, hence it's a win.
>
> But for PMEM, directly zeroing a 4k chunk of data should take *much
> less CPU* than synchronously starting a transaction, reserving
> space, looking up the extent in the BMBT, loading the buffers from
> cache, modifying the buffers, logging the changes and committing the
> transaction (which in itself will typically copy more than a single
> page of metadata into the CIL buffers).
>
> Realistically, dax_clear_blocks() should probably be implemented at
> the pmem driver layer through blkdev_issue_zeroout() because all it
> does is directly map the sector/len to pfn via bdev_direct_access()
> and then zero it - it's a sector based, block device operation. We
> don't actually need a special case path for DAX here. Optimisation
> of this operation has little to do with the filesystem.
>
> This comes back to the comments I made w.r.t. the pmem driver
> implementation doing synchronous IO by immediately forcing CPU cache
> flushes and barriers. it's obviously correct, but it looks like
> there's going to be a major performance penalty associated with it.
> This is why I recently suggested that a pmem driver that doesn't do
> CPU cache writeback during IO but does it on REQ_FLUSH is an
> architecture we'll likely have to support.
>

The only thing we can realistically delay is wmb_pmem() i.e. the final
sync waiting for data that has *left* the cpu cache.  Unless/until we
get a architecturally guaranteed method to write-back the entire
cache, or flush the cache by physical-cache-way we're stuck with
either non-temporal cycles or looping on potentially huge virtual
address ranges.

> In this case, the block device won't need to flush CPU cache lines
> for the zeroed blocks until the allocation transaction is actually
> committed to the journal. In that case, there's a good chance that
> we'd end up also committing the new data as well, hence avoiding two
> synchronous memory writes. i.e. the "big hammer" REQ_FLUSH
> implementation may well be *much* faster than fine-grained
> synchronous "stable on completion" writes for persistent memory.

I can only see it being faster in the case where the flush is cheap to
initiate.   That's not the case yet so we're stuck doing it
synchronously.

>
> This, however, is not really a problem for the filesystem - it's
> a pmem driver architecture problem. ;)
>

It's a platform problem.  Let's see how this looks when not using
clflush instructions.

Also, another benefit of pushing zeroing down into the driver is that
for brd, as used in this example, it will rightly be a nop because
there's no persistence to guarantee there.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-11-03  3:53               ` Dan Williams
@ 2015-11-03  5:04                 ` Dave Chinner
  2015-11-04  0:50                   ` Ross Zwisler
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2015-11-03  5:04 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm@lists.01.org, Brian Foster, xfs,
	linux-fsdevel, Ross Zwisler

On Mon, Nov 02, 2015 at 07:53:27PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote:
> > [add people to the cc list]
> >
> > On Mon, Nov 02, 2015 at 09:15:10AM -0500, Brian Foster wrote:
> >> On Mon, Nov 02, 2015 at 12:14:33PM +1100, Dave Chinner wrote:
> >> > On Fri, Oct 30, 2015 at 08:36:57AM -0400, Brian Foster wrote:
> >> > > Unless there is some
> >> > > special mixed dio/mmap case I'm missing, doing so for DAX/DIO basically
> >> > > causes a clear_pmem() over every page sized chunk of the target I/O
> >> > > range for which we already have the data.
> >> >
> >> > I don't follow - this only zeros blocks when we do allocation of new
> >> > blocks or overwrite unwritten extents, not on blocks which we
> >> > already have written data extents allocated for...
> >> >
> >>
> >> Why are we assuming that block zeroing is more efficient than unwritten
> >> extents for DAX/dio? I haven't played with pmem enough to know for sure
> >> one way or another (or if hw support is imminent), but I'd expect the
> >> latter to be more efficient in general without any kind of hardware
> >> support.
> >>
> >> Just as an example, here's an 8GB pwrite test, large buffer size, to XFS
> >> on a ramdisk mounted with '-o dax:'
> >>
> >> - Before this series:
> >>
> >> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
> >> wrote 8589934592/8589934592 bytes at offset 0
> >> 8.000 GiB, 820 ops; 0:00:04.00 (1.909 GiB/sec and 195.6591 ops/sec)
> >>
> >> - After this series:
> >>
> >> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file
> >> wrote 8589934592/8589934592 bytes at offset 0
> >> 8.000 GiB, 820 ops; 0:00:12.00 (659.790 MiB/sec and 66.0435 ops/sec)
> >
> > That looks wrong. Much, much slower than it should be just zeroing
> > pages and then writing to them again while cache hot.
> >
> > Oh, hell - dax_clear_blocks() is stupidly slow. A profile shows this
> > loop spending most of the CPU time:
> >
> >        ¿     ¿ jbe    ea
> >        ¿ de:   clflus %ds:(%rax)
> >  84.67 ¿       add    %rsi,%rax
> >        ¿       cmp    %rax,%rdx
> >        ¿     ¿ ja     de
> >        ¿ ea:   add    %r13,-0x38(%rbp)
> >        ¿       sub    %r12,%r14
> >        ¿       sub    %r12,-0x40(%rbp)
> >
> > That is the overhead of __arch_wb_cache_pmem() i.e. issuing CPU
> > cache flushes after each memset.
> 
> Ideally this would be non-temporal and skip the second flush loop
> altogether.  Outside of that another problem is that this cpu does not
> support the clwb instruction and is instead using the serializing and
> invalidating clflush instruction.

Sure, it can be optimised to improve behaviour, and other hardware
will be more performant, but we'll still be doing synchronous cache
flushing operations here and so the fundamental problem still
exists.

> > This comes back to the comments I made w.r.t. the pmem driver
> > implementation doing synchronous IO by immediately forcing CPU cache
> > flushes and barriers. it's obviously correct, but it looks like
> > there's going to be a major performance penalty associated with it.
> > This is why I recently suggested that a pmem driver that doesn't do
> > CPU cache writeback during IO but does it on REQ_FLUSH is an
> > architecture we'll likely have to support.
> >
> 
> The only thing we can realistically delay is wmb_pmem() i.e. the final
> sync waiting for data that has *left* the cpu cache.  Unless/until we
> get a architecturally guaranteed method to write-back the entire
> cache, or flush the cache by physical-cache-way we're stuck with
> either non-temporal cycles or looping on potentially huge virtual
> address ranges.

I'm missing something: why won't flushing the address range returned
by bdev_direct_access() during a fsync operation work? i.e. we're
working with exactly the same address as dax_clear_blocks() and
dax_do_io() use, so why can't we look up that address and flush it
from fsync?

> > This, however, is not really a problem for the filesystem - it's
> > a pmem driver architecture problem. ;)
> >
> 
> It's a platform problem.  Let's see how this looks when not using
> clflush instructions.

Yes, clflush vs clwb is that's a platform problem.

However, the architectural problem I've refered to is that is we've
designed the DAX infrastructure around a CPU implementation that
requires synchronous memory flushes for persistence, and so driven
that synchronous flush requirement into the DAX implementation
itself....

> Also, another benefit of pushing zeroing down into the driver is that
> for brd, as used in this example, it will rightly be a nop because
> there's no persistence to guarantee there.

Precisely my point - different drivers and hardware have different
semantics and optimisations, and only by separating the data copying
from the persistence model do we end up with infrastructure that is
flexible enough to work with different hardware/driver pmem models...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-11-03  5:04                 ` Dave Chinner
@ 2015-11-04  0:50                   ` Ross Zwisler
  2015-11-04  1:02                     ` Dan Williams
  0 siblings, 1 reply; 10+ messages in thread
From: Ross Zwisler @ 2015-11-04  0:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, Brian Foster, Ross Zwisler, Jan Kara, xfs,
	linux-fsdevel, linux-nvdimm@lists.01.org

On Tue, Nov 03, 2015 at 04:04:13PM +1100, Dave Chinner wrote:
> On Mon, Nov 02, 2015 at 07:53:27PM -0800, Dan Williams wrote:
> > On Mon, Nov 2, 2015 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote:
<>
> > > This comes back to the comments I made w.r.t. the pmem driver
> > > implementation doing synchronous IO by immediately forcing CPU cache
> > > flushes and barriers. it's obviously correct, but it looks like
> > > there's going to be a major performance penalty associated with it.
> > > This is why I recently suggested that a pmem driver that doesn't do
> > > CPU cache writeback during IO but does it on REQ_FLUSH is an
> > > architecture we'll likely have to support.
> > >
> > 
> > The only thing we can realistically delay is wmb_pmem() i.e. the final
> > sync waiting for data that has *left* the cpu cache.  Unless/until we
> > get a architecturally guaranteed method to write-back the entire
> > cache, or flush the cache by physical-cache-way we're stuck with
> > either non-temporal cycles or looping on potentially huge virtual
> > address ranges.
> 
> I'm missing something: why won't flushing the address range returned
> by bdev_direct_access() during a fsync operation work? i.e. we're
> working with exactly the same address as dax_clear_blocks() and
> dax_do_io() use, so why can't we look up that address and flush it
> from fsync?

I could be wrong, but I don't see a reason why DAX can't use the strategy of
writing data and marking it dirty in one step and then flushing later in
response to fsync/msync.  I think this could be used everywhere we write or
zero data - dax_clear_blocks(), dax_io() etc.  (I believe that lots of the
block zeroing code will go away once we have the XFS and ext4 patches in that
guarantee we will only get written and zeroed extents from the filesystem in
response to get_block().)  I think the PMEM driver, lacking the ability to
mark things as dirty in the radix tree, etc, will need to keep doing things
synchronously.

Hmm...if we go this path, though, is that an argument against moving the
zeroing from DAX down into the driver?  True, with BRD it makes things nice
and efficient because you can zero and never flush, and the driver knows
there's nothing else to do.

For PMEM, though, you lose the ability to zero the data and then queue the
flushing for later, as you would be able to do if you left the zeroing code in
DAX.  The benefit of this is that if you are going to immediately re-write the
newly zeroed data (which seems common), PMEM will end up doing an extra cache
flush of the zeroes, only to have them overwritten and marked as dirty by DAX.
If we leave the zeroing to DAX we can mark it dirty once, zero it once, write
it once, and flush it once.

This would make us lose the ability to do hardware-assisted flushing in the
future that requires driver specific knowledge, though I don't think that
exists yet.  Perhaps we should leave the zeroing in DAX for now to take
advantage of the single flush, and then move it down if a driver can improve
performance with hardware assisted PMEM zeroing?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-11-04  0:50                   ` Ross Zwisler
@ 2015-11-04  1:02                     ` Dan Williams
  2015-11-04  4:46                       ` Ross Zwisler
  0 siblings, 1 reply; 10+ messages in thread
From: Dan Williams @ 2015-11-04  1:02 UTC (permalink / raw)
  To: Ross Zwisler, Dave Chinner, Dan Williams, Brian Foster, Jan Kara,
	xfs, linux-fsdevel, linux-nvdimm@lists.01.org

On Tue, Nov 3, 2015 at 4:50 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Tue, Nov 03, 2015 at 04:04:13PM +1100, Dave Chinner wrote:
>> On Mon, Nov 02, 2015 at 07:53:27PM -0800, Dan Williams wrote:
>> > On Mon, Nov 2, 2015 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote:
> <>
>> > > This comes back to the comments I made w.r.t. the pmem driver
>> > > implementation doing synchronous IO by immediately forcing CPU cache
>> > > flushes and barriers. it's obviously correct, but it looks like
>> > > there's going to be a major performance penalty associated with it.
>> > > This is why I recently suggested that a pmem driver that doesn't do
>> > > CPU cache writeback during IO but does it on REQ_FLUSH is an
>> > > architecture we'll likely have to support.
>> > >
>> >
>> > The only thing we can realistically delay is wmb_pmem() i.e. the final
>> > sync waiting for data that has *left* the cpu cache.  Unless/until we
>> > get a architecturally guaranteed method to write-back the entire
>> > cache, or flush the cache by physical-cache-way we're stuck with
>> > either non-temporal cycles or looping on potentially huge virtual
>> > address ranges.
>>
>> I'm missing something: why won't flushing the address range returned
>> by bdev_direct_access() during a fsync operation work? i.e. we're
>> working with exactly the same address as dax_clear_blocks() and
>> dax_do_io() use, so why can't we look up that address and flush it
>> from fsync?
>
> I could be wrong, but I don't see a reason why DAX can't use the strategy of
> writing data and marking it dirty in one step and then flushing later in
> response to fsync/msync.  I think this could be used everywhere we write or
> zero data - dax_clear_blocks(), dax_io() etc.  (I believe that lots of the
> block zeroing code will go away once we have the XFS and ext4 patches in that
> guarantee we will only get written and zeroed extents from the filesystem in
> response to get_block().)  I think the PMEM driver, lacking the ability to
> mark things as dirty in the radix tree, etc, will need to keep doing things
> synchronously.

Not without numbers showing the relative performance of dirtying cache
followed by flushing vs non-temporal + pcommit.

> Hmm...if we go this path, though, is that an argument against moving the
> zeroing from DAX down into the driver?  True, with BRD it makes things nice
> and efficient because you can zero and never flush, and the driver knows
> there's nothing else to do.
>
> For PMEM, though, you lose the ability to zero the data and then queue the
> flushing for later, as you would be able to do if you left the zeroing code in
> DAX.  The benefit of this is that if you are going to immediately re-write the
> newly zeroed data (which seems common), PMEM will end up doing an extra cache
> flush of the zeroes, only to have them overwritten and marked as dirty by DAX.
> If we leave the zeroing to DAX we can mark it dirty once, zero it once, write
> it once, and flush it once.

Why do we lose the ability to flush later if the driver supports
blkdev_issue_zeroout?

> This would make us lose the ability to do hardware-assisted flushing in the
> future that requires driver specific knowledge, though I don't think that
> exists yet.

ioatdma has supported memset() for a while now, but I would prioritize
a non-temporal SIMD implementation first.

> Perhaps we should leave the zeroing in DAX for now to take
> advantage of the single flush, and then move it down if a driver can improve
> performance with hardware assisted PMEM zeroing?

Not convinced.  I think we should implement the driver zeroing
solution and take a look at performance.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-11-04  1:02                     ` Dan Williams
@ 2015-11-04  4:46                       ` Ross Zwisler
  2015-11-04  9:06                         ` Jan Kara
  0 siblings, 1 reply; 10+ messages in thread
From: Ross Zwisler @ 2015-11-04  4:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, Dave Chinner, Brian Foster, Jan Kara, xfs,
	linux-fsdevel, linux-nvdimm@lists.01.org

On Tue, Nov 03, 2015 at 05:02:34PM -0800, Dan Williams wrote:
> On Tue, Nov 3, 2015 at 4:50 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Tue, Nov 03, 2015 at 04:04:13PM +1100, Dave Chinner wrote:
> >> On Mon, Nov 02, 2015 at 07:53:27PM -0800, Dan Williams wrote:
> >> > On Mon, Nov 2, 2015 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote:
> > <>
> >> > > This comes back to the comments I made w.r.t. the pmem driver
> >> > > implementation doing synchronous IO by immediately forcing CPU cache
> >> > > flushes and barriers. it's obviously correct, but it looks like
> >> > > there's going to be a major performance penalty associated with it.
> >> > > This is why I recently suggested that a pmem driver that doesn't do
> >> > > CPU cache writeback during IO but does it on REQ_FLUSH is an
> >> > > architecture we'll likely have to support.
> >> > >
> >> >
> >> > The only thing we can realistically delay is wmb_pmem() i.e. the final
> >> > sync waiting for data that has *left* the cpu cache.  Unless/until we
> >> > get a architecturally guaranteed method to write-back the entire
> >> > cache, or flush the cache by physical-cache-way we're stuck with
> >> > either non-temporal cycles or looping on potentially huge virtual
> >> > address ranges.
> >>
> >> I'm missing something: why won't flushing the address range returned
> >> by bdev_direct_access() during a fsync operation work? i.e. we're
> >> working with exactly the same address as dax_clear_blocks() and
> >> dax_do_io() use, so why can't we look up that address and flush it
> >> from fsync?
> >
> > I could be wrong, but I don't see a reason why DAX can't use the strategy of
> > writing data and marking it dirty in one step and then flushing later in
> > response to fsync/msync.  I think this could be used everywhere we write or
> > zero data - dax_clear_blocks(), dax_io() etc.  (I believe that lots of the
> > block zeroing code will go away once we have the XFS and ext4 patches in that
> > guarantee we will only get written and zeroed extents from the filesystem in
> > response to get_block().)  I think the PMEM driver, lacking the ability to
> > mark things as dirty in the radix tree, etc, will need to keep doing things
> > synchronously.
> 
> Not without numbers showing the relative performance of dirtying cache
> followed by flushing vs non-temporal + pcommit.

Sorry - do you mean that you want to make sure that we get a performance
benefit from the "dirty and flush later" path vs the "write and flush now"
path?  Sure, that seems reasonable.

> > Hmm...if we go this path, though, is that an argument against moving the
> > zeroing from DAX down into the driver?  True, with BRD it makes things nice
> > and efficient because you can zero and never flush, and the driver knows
> > there's nothing else to do.
> >
> > For PMEM, though, you lose the ability to zero the data and then queue the
> > flushing for later, as you would be able to do if you left the zeroing code in
> > DAX.  The benefit of this is that if you are going to immediately re-write the
> > newly zeroed data (which seems common), PMEM will end up doing an extra cache
> > flush of the zeroes, only to have them overwritten and marked as dirty by DAX.
> > If we leave the zeroing to DAX we can mark it dirty once, zero it once, write
> > it once, and flush it once.
> 
> Why do we lose the ability to flush later if the driver supports
> blkdev_issue_zeroout?

I think that if you implement zeroing in the driver you'd need to also flush
in the driver because you wouldn't have access to the radix tree to be able to
mark entries as dirty so you can flush them later.

As I think about this more, though, I'm not sure that having the zeroing flush
later could work.  I'm guessing that the filesystem must require a sync point
between the zeroing and the subsequent follow-up writes so that you can sync
metadata for the block allocation.  Otherwise you could end up in a situation
where you've got your metadata pointing at newly allocated blocks but the new
zeros are still in the processor cache - if you lose power you've just created
an information leak.   Dave, Jan, does this make sense?  

> > This would make us lose the ability to do hardware-assisted flushing in the
> > future that requires driver specific knowledge, though I don't think that
> > exists yet.
> 
> ioatdma has supported memset() for a while now, but I would prioritize
> a non-temporal SIMD implementation first.

Sweet, didn't know about that, obviously.  :)  Thanks for the pointer.

> > Perhaps we should leave the zeroing in DAX for now to take
> > advantage of the single flush, and then move it down if a driver can improve
> > performance with hardware assisted PMEM zeroing?
> 
> Not convinced.  I think we should implement the driver zeroing
> solution and take a look at performance.

I agree, this should all be driven by performance measurements.  Thanks for
the feedback.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-11-04  4:46                       ` Ross Zwisler
@ 2015-11-04  9:06                         ` Jan Kara
  2015-11-04 15:35                           ` Ross Zwisler
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2015-11-04  9:06 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Dan Williams, Dave Chinner, Brian Foster, Jan Kara, xfs,
	linux-fsdevel, linux-nvdimm@lists.01.org

On Tue 03-11-15 21:46:13, Ross Zwisler wrote:
> On Tue, Nov 03, 2015 at 05:02:34PM -0800, Dan Williams wrote:
> > > Hmm...if we go this path, though, is that an argument against moving the
> > > zeroing from DAX down into the driver?  True, with BRD it makes things nice
> > > and efficient because you can zero and never flush, and the driver knows
> > > there's nothing else to do.
> > >
> > > For PMEM, though, you lose the ability to zero the data and then queue the
> > > flushing for later, as you would be able to do if you left the zeroing code in
> > > DAX.  The benefit of this is that if you are going to immediately re-write the
> > > newly zeroed data (which seems common), PMEM will end up doing an extra cache
> > > flush of the zeroes, only to have them overwritten and marked as dirty by DAX.
> > > If we leave the zeroing to DAX we can mark it dirty once, zero it once, write
> > > it once, and flush it once.
> > 
> > Why do we lose the ability to flush later if the driver supports
> > blkdev_issue_zeroout?
> 
> I think that if you implement zeroing in the driver you'd need to also
> flush in the driver because you wouldn't have access to the radix tree to
> be able to mark entries as dirty so you can flush them later.
> 
> As I think about this more, though, I'm not sure that having the zeroing
> flush later could work.  I'm guessing that the filesystem must require a
> sync point between the zeroing and the subsequent follow-up writes so
> that you can sync metadata for the block allocation.  Otherwise you could
> end up in a situation where you've got your metadata pointing at newly
> allocated blocks but the new zeros are still in the processor cache - if
> you lose power you've just created an information leak.   Dave, Jan, does
> this make sense?  

So the problem you describe does not exist. Thing to keep in mind is that
filesystem are designed to work reliably with 'non-persistent' cache in the
disk which is common these days. That's why we bother with all that
REQ_FLUSH | REQ_FUA and blkdev_issue_flush() stuff after all. Processor
cache is exactly that kind of the cache attached to the PMEM storage. And
Dave and I try to steer you to a solution that would also treat it equally
in DAX filesystems as well :).

Now how the problem is currently solved: When we allocate blocks, we just
record that information in a transaction in the journal. For DAX case we
also submit the IO zeroing those blocks and wait for it. Now if we crash
before the transaction gets committed, blocks won't be seen in the inode
after a journal recovery and thus no data exposure can happen. As a part of
transaction commit, we call blkdev_issue_flush() (or submit REQ_FLUSH
request). We expect that to force out all the IO in volatile caches into
the persistent storage. So this will also force the zeroing into persistent
storage for normal disks and AFAIU if you do zeroing with non-temporal
writes in pmem driver and then do wmb_pmem() in response to a flush request
we get the same persistency guarantee in pmem case as well. So after a
transaction commit we are guaranteed to see zeros in those allocated
blocks. 

So the transaction commit and the corresponding flush request in particular
is the sync point you speak about above but the good thing is that in most
cases this will happen after real data gets written into those blocks so we
save the unnecessary flush.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-11-04  9:06                         ` Jan Kara
@ 2015-11-04 15:35                           ` Ross Zwisler
  2015-11-04 17:21                             ` Jan Kara
  0 siblings, 1 reply; 10+ messages in thread
From: Ross Zwisler @ 2015-11-04 15:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Dan Williams, Dave Chinner, Brian Foster, xfs,
	linux-fsdevel, linux-nvdimm@lists.01.org

On Wed, Nov 04, 2015 at 10:06:12AM +0100, Jan Kara wrote:
> On Tue 03-11-15 21:46:13, Ross Zwisler wrote:
> > On Tue, Nov 03, 2015 at 05:02:34PM -0800, Dan Williams wrote:
> > > > Hmm...if we go this path, though, is that an argument against moving the
> > > > zeroing from DAX down into the driver?  True, with BRD it makes things nice
> > > > and efficient because you can zero and never flush, and the driver knows
> > > > there's nothing else to do.
> > > >
> > > > For PMEM, though, you lose the ability to zero the data and then queue the
> > > > flushing for later, as you would be able to do if you left the zeroing code in
> > > > DAX.  The benefit of this is that if you are going to immediately re-write the
> > > > newly zeroed data (which seems common), PMEM will end up doing an extra cache
> > > > flush of the zeroes, only to have them overwritten and marked as dirty by DAX.
> > > > If we leave the zeroing to DAX we can mark it dirty once, zero it once, write
> > > > it once, and flush it once.
> > > 
> > > Why do we lose the ability to flush later if the driver supports
> > > blkdev_issue_zeroout?
> > 
> > I think that if you implement zeroing in the driver you'd need to also
> > flush in the driver because you wouldn't have access to the radix tree to
> > be able to mark entries as dirty so you can flush them later.
> > 
> > As I think about this more, though, I'm not sure that having the zeroing
> > flush later could work.  I'm guessing that the filesystem must require a
> > sync point between the zeroing and the subsequent follow-up writes so
> > that you can sync metadata for the block allocation.  Otherwise you could
> > end up in a situation where you've got your metadata pointing at newly
> > allocated blocks but the new zeros are still in the processor cache - if
> > you lose power you've just created an information leak.   Dave, Jan, does
> > this make sense?  
> 
> So the problem you describe does not exist. Thing to keep in mind is that
> filesystem are designed to work reliably with 'non-persistent' cache in the
> disk which is common these days. That's why we bother with all that
> REQ_FLUSH | REQ_FUA and blkdev_issue_flush() stuff after all. Processor
> cache is exactly that kind of the cache attached to the PMEM storage. And
> Dave and I try to steer you to a solution that would also treat it equally
> in DAX filesystems as well :).

And I'm always grateful for the guidance. :)

> Now how the problem is currently solved: When we allocate blocks, we just
> record that information in a transaction in the journal. For DAX case we
> also submit the IO zeroing those blocks and wait for it. Now if we crash
> before the transaction gets committed, blocks won't be seen in the inode
> after a journal recovery and thus no data exposure can happen. As a part of
> transaction commit, we call blkdev_issue_flush() (or submit REQ_FLUSH
> request). We expect that to force out all the IO in volatile caches into
> the persistent storage. So this will also force the zeroing into persistent
> storage for normal disks and AFAIU if you do zeroing with non-temporal
> writes in pmem driver and then do wmb_pmem() in response to a flush request
> we get the same persistency guarantee in pmem case as well. So after a
> transaction commit we are guaranteed to see zeros in those allocated
> blocks. 
> 
> So the transaction commit and the corresponding flush request in particular
> is the sync point you speak about above but the good thing is that in most
> cases this will happen after real data gets written into those blocks so we
> save the unnecessary flush.

Cool, thank you for the explanation, that makes sense to me.

When dealing with normal SSDs and the page cache, does the filesystem keep the
zeroes in the page cache, or does it issue it directly to the driver via
sb_issue_zeroout()/blkdev_issue_zeroout()?  If we keep it in the page cache so
that the follow-up writes just update the dirty pages and we end up writing to
media once, this seems like it would flow nicely into the idea of zeroing new
blocks at the DAX level without flushing and just marking them as dirty in the
radix tree.  If the zeroing happens via sb_issue_zeroout() then this probably
doesn't make sense because the existing flow won't include a fsync/msync type
step of the newly zeroed data in the page cache.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-11-04 15:35                           ` Ross Zwisler
@ 2015-11-04 17:21                             ` Jan Kara
  0 siblings, 0 replies; 10+ messages in thread
From: Jan Kara @ 2015-11-04 17:21 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, Dan Williams, Dave Chinner, Brian Foster, xfs,
	linux-fsdevel, linux-nvdimm@lists.01.org

On Wed 04-11-15 08:35:51, Ross Zwisler wrote:
> On Wed, Nov 04, 2015 at 10:06:12AM +0100, Jan Kara wrote:
> > Now how the problem is currently solved: When we allocate blocks, we just
> > record that information in a transaction in the journal. For DAX case we
> > also submit the IO zeroing those blocks and wait for it. Now if we crash
> > before the transaction gets committed, blocks won't be seen in the inode
> > after a journal recovery and thus no data exposure can happen. As a part of
> > transaction commit, we call blkdev_issue_flush() (or submit REQ_FLUSH
> > request). We expect that to force out all the IO in volatile caches into
> > the persistent storage. So this will also force the zeroing into persistent
> > storage for normal disks and AFAIU if you do zeroing with non-temporal
> > writes in pmem driver and then do wmb_pmem() in response to a flush request
> > we get the same persistency guarantee in pmem case as well. So after a
> > transaction commit we are guaranteed to see zeros in those allocated
> > blocks. 
> > 
> > So the transaction commit and the corresponding flush request in particular
> > is the sync point you speak about above but the good thing is that in most
> > cases this will happen after real data gets written into those blocks so we
> > save the unnecessary flush.
> 
> Cool, thank you for the explanation, that makes sense to me.
> 
> When dealing with normal SSDs and the page cache, does the filesystem keep the
> zeroes in the page cache, or does it issue it directly to the driver via
> sb_issue_zeroout()/blkdev_issue_zeroout()?  If we keep it in the page cache so

Currently we use sb_issue_zeroout() for zeroing out blocks in ext4 (for DAX
and in some other cases to avoid splitting extents too much).  Note that
zeroing blocks in page cache won't work on it's own - blkdev_issue_flush()
doesn't force out any dirty pages from page cache so transaction commit
wouldn't make sure stale block contents is not exposed.

However we actually do have a mechanism in ext4 to force out data from page
cache during transaction commit - that is what data=ordered mode is all
about - we can attach inode to a transaction and all dirty data of that
inode that has underlying blocks allocated is flushed as a part of
transaction commit. So this makes sure stale data cannot be seen after a
crash in data=ordered mode. XFS doesn't have this mechanism and thus it
normally has to put allocated blocks in unwritten extents, submit IO to
write data to those blocks, and convert extent to written once IO finishes.

> that the follow-up writes just update the dirty pages and we end up
> writing to media once, this seems like it would flow nicely into the idea
> of zeroing new blocks at the DAX level without flushing and just marking
> them as dirty in the radix tree.

So for ext4 you could make this work the same way data=ordered mode works -
just store zeroes into allocated blocks, mark page as dirty in the radix
tree, attach inode to the running transaction, and on transaction commit do
the flush. However note that for DAX page faults zeroing needs to happen
under fs lock serializing faults to the same page which effectively means
it happens inside filesystem's block mapping function and at that place we
don't have a page to zero out. So it would require considerable plumbing to
make this work even for ext4 and I'm not even speaking of XFS where the
flushing mechanism on transaction commit doesn't currently exist AFAIK.

Frankly since the savings would be realized only for allocating writes into
mmaped file which aren't than common, I'm not convinced this would be worth
it.

> If the zeroing happens via sb_issue_zeroout() then this probably doesn't
> make sense because the existing flow won't include a fsync/msync type
> step of the newly zeroed data in the page cache.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX
  2015-11-02 21:44             ` [PATCH 3/6] xfs: Don't use unwritten extents for DAX Dave Chinner
  2015-11-03  3:53               ` Dan Williams
@ 2015-11-03  9:16               ` Jan Kara
  1 sibling, 0 replies; 10+ messages in thread
From: Jan Kara @ 2015-11-03  9:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: jack, linux-nvdimm, Brian Foster, xfs, linux-fsdevel,
	ross.zwisler, dan.j.williams

On Tue 03-11-15 08:44:24, Dave Chinner wrote:
> Realistically, dax_clear_blocks() should probably be implemented at
> the pmem driver layer through blkdev_issue_zeroout() because all it
> does is directly map the sector/len to pfn via bdev_direct_access()
> and then zero it - it's a sector based, block device operation. We
> don't actually need a special case path for DAX here. Optimisation
> of this operation has little to do with the filesystem.

Yep. This is actually what I did in ext4 - the block zeroing is using the
ext4 block zeroout path which ends up calling blkdev_issue_zeroout(). I
didn't want to special-case DAX and figured out that if we want
performance, we should implement blkdev_issue_zeroout() efficiently for
pmem. After all ext4 uses blkdev_issue_zeroout() in other extent conversion
cases where zeroing out is needed.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-11-04 17:21 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1445225238-30413-1-git-send-email-david@fromorbit.com>
     [not found] ` <1445225238-30413-4-git-send-email-david@fromorbit.com>
     [not found]   ` <20151029142950.GE11663@bfoster.bfoster>
     [not found]     ` <20151029233756.GS19199@dastard>
     [not found]       ` <20151030123657.GC54905@bfoster.bfoster>
     [not found]         ` <20151102011433.GW19199@dastard>
     [not found]           ` <20151102141509.GA29346@bfoster.bfoster>
2015-11-02 21:44             ` [PATCH 3/6] xfs: Don't use unwritten extents for DAX Dave Chinner
2015-11-03  3:53               ` Dan Williams
2015-11-03  5:04                 ` Dave Chinner
2015-11-04  0:50                   ` Ross Zwisler
2015-11-04  1:02                     ` Dan Williams
2015-11-04  4:46                       ` Ross Zwisler
2015-11-04  9:06                         ` Jan Kara
2015-11-04 15:35                           ` Ross Zwisler
2015-11-04 17:21                             ` Jan Kara
2015-11-03  9:16               ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).