* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX [not found] ` <20151102141509.GA29346@bfoster.bfoster> @ 2015-11-02 21:44 ` Dave Chinner 2015-11-03 3:53 ` Dan Williams 2015-11-03 9:16 ` Jan Kara 0 siblings, 2 replies; 10+ messages in thread From: Dave Chinner @ 2015-11-02 21:44 UTC (permalink / raw) To: Brian Foster Cc: ross.zwisler, jack, xfs, linux-fsdevel, dan.j.williams, linux-nvdimm [add people to the cc list] On Mon, Nov 02, 2015 at 09:15:10AM -0500, Brian Foster wrote: > On Mon, Nov 02, 2015 at 12:14:33PM +1100, Dave Chinner wrote: > > On Fri, Oct 30, 2015 at 08:36:57AM -0400, Brian Foster wrote: > > > Unless there is some > > > special mixed dio/mmap case I'm missing, doing so for DAX/DIO basically > > > causes a clear_pmem() over every page sized chunk of the target I/O > > > range for which we already have the data. > > > > I don't follow - this only zeros blocks when we do allocation of new > > blocks or overwrite unwritten extents, not on blocks which we > > already have written data extents allocated for... > > > > Why are we assuming that block zeroing is more efficient than unwritten > extents for DAX/dio? I haven't played with pmem enough to know for sure > one way or another (or if hw support is imminent), but I'd expect the > latter to be more efficient in general without any kind of hardware > support. > > Just as an example, here's an 8GB pwrite test, large buffer size, to XFS > on a ramdisk mounted with '-o dax:' > > - Before this series: > > # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file > wrote 8589934592/8589934592 bytes at offset 0 > 8.000 GiB, 820 ops; 0:00:04.00 (1.909 GiB/sec and 195.6591 ops/sec) > > - After this series: > > # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file > wrote 8589934592/8589934592 bytes at offset 0 > 8.000 GiB, 820 ops; 0:00:12.00 (659.790 MiB/sec and 66.0435 ops/sec) That looks wrong. Much, much slower than it should be just zeroing pages and then writing to them again while cache hot. Oh, hell - dax_clear_blocks() is stupidly slow. A profile shows this loop spending most of the CPU time: ¿ ¿ jbe ea ¿ de: clflus %ds:(%rax) 84.67 ¿ add %rsi,%rax ¿ cmp %rax,%rdx ¿ ¿ ja de ¿ ea: add %r13,-0x38(%rbp) ¿ sub %r12,%r14 ¿ sub %r12,-0x40(%rbp) That is the overhead of __arch_wb_cache_pmem() i.e. issuing CPU cache flushes after each memset. None of these pmem memory operations are optimised yet - the implementations are correct, but performance still needs work. The conversion to non-temporal stores should get rid of this cache flush overhead (somewhat), but I was still expecting this code to be much closer to memset speed and not reduce performance to bugger all... > The impact is less with a smaller buffer size so the above is just meant > to illustrate the point. FWIW, I'm also fine with getting this in as a > matter of "correctness before performance" since this stuff is clearly > still under development, but as far as I can see so far we should > probably ultimately prefer unwritten extents for DAX/DIO (or at least > plan to run some similar tests on real pmem hw). Thoughts? We're going to have these problems initially, but from the XFS persepective those problems don't matter because we have a different problem to solve. That is, we need to move towards an architecture that is highly efficient for low latency, high IOPS storage subsystems. The first step towards that is to be able to offload block zeroing to the hardware so we can avoid using unwritten extents. In the long run, we don't want to use unwritten extents on DAX if we can avoid it - the CPU overhead of unwritten extent conversion isn't constant (i.e. it grows with the size of the BMBT) and it isn't deterministic (e.g. split/merge take much more CPU than a simple record write to clear an unwritten flag). We don't notice this overhead much with normal IO because of the fact that the CPU time for conversion is much less than the CPU time for the IO to complete, hence it's a win. But for PMEM, directly zeroing a 4k chunk of data should take *much less CPU* than synchronously starting a transaction, reserving space, looking up the extent in the BMBT, loading the buffers from cache, modifying the buffers, logging the changes and committing the transaction (which in itself will typically copy more than a single page of metadata into the CIL buffers). Realistically, dax_clear_blocks() should probably be implemented at the pmem driver layer through blkdev_issue_zeroout() because all it does is directly map the sector/len to pfn via bdev_direct_access() and then zero it - it's a sector based, block device operation. We don't actually need a special case path for DAX here. Optimisation of this operation has little to do with the filesystem. This comes back to the comments I made w.r.t. the pmem driver implementation doing synchronous IO by immediately forcing CPU cache flushes and barriers. it's obviously correct, but it looks like there's going to be a major performance penalty associated with it. This is why I recently suggested that a pmem driver that doesn't do CPU cache writeback during IO but does it on REQ_FLUSH is an architecture we'll likely have to support. In this case, the block device won't need to flush CPU cache lines for the zeroed blocks until the allocation transaction is actually committed to the journal. In that case, there's a good chance that we'd end up also committing the new data as well, hence avoiding two synchronous memory writes. i.e. the "big hammer" REQ_FLUSH implementation may well be *much* faster than fine-grained synchronous "stable on completion" writes for persistent memory. This, however, is not really a problem for the filesystem - it's a pmem driver architecture problem. ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX 2015-11-02 21:44 ` [PATCH 3/6] xfs: Don't use unwritten extents for DAX Dave Chinner @ 2015-11-03 3:53 ` Dan Williams 2015-11-03 5:04 ` Dave Chinner 2015-11-03 9:16 ` Jan Kara 1 sibling, 1 reply; 10+ messages in thread From: Dan Williams @ 2015-11-03 3:53 UTC (permalink / raw) To: Dave Chinner Cc: Brian Foster, Ross Zwisler, Jan Kara, xfs, linux-fsdevel, linux-nvdimm@lists.01.org On Mon, Nov 2, 2015 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote: > [add people to the cc list] > > On Mon, Nov 02, 2015 at 09:15:10AM -0500, Brian Foster wrote: >> On Mon, Nov 02, 2015 at 12:14:33PM +1100, Dave Chinner wrote: >> > On Fri, Oct 30, 2015 at 08:36:57AM -0400, Brian Foster wrote: >> > > Unless there is some >> > > special mixed dio/mmap case I'm missing, doing so for DAX/DIO basically >> > > causes a clear_pmem() over every page sized chunk of the target I/O >> > > range for which we already have the data. >> > >> > I don't follow - this only zeros blocks when we do allocation of new >> > blocks or overwrite unwritten extents, not on blocks which we >> > already have written data extents allocated for... >> > >> >> Why are we assuming that block zeroing is more efficient than unwritten >> extents for DAX/dio? I haven't played with pmem enough to know for sure >> one way or another (or if hw support is imminent), but I'd expect the >> latter to be more efficient in general without any kind of hardware >> support. >> >> Just as an example, here's an 8GB pwrite test, large buffer size, to XFS >> on a ramdisk mounted with '-o dax:' >> >> - Before this series: >> >> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file >> wrote 8589934592/8589934592 bytes at offset 0 >> 8.000 GiB, 820 ops; 0:00:04.00 (1.909 GiB/sec and 195.6591 ops/sec) >> >> - After this series: >> >> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file >> wrote 8589934592/8589934592 bytes at offset 0 >> 8.000 GiB, 820 ops; 0:00:12.00 (659.790 MiB/sec and 66.0435 ops/sec) > > That looks wrong. Much, much slower than it should be just zeroing > pages and then writing to them again while cache hot. > > Oh, hell - dax_clear_blocks() is stupidly slow. A profile shows this > loop spending most of the CPU time: > > ¿ ¿ jbe ea > ¿ de: clflus %ds:(%rax) > 84.67 ¿ add %rsi,%rax > ¿ cmp %rax,%rdx > ¿ ¿ ja de > ¿ ea: add %r13,-0x38(%rbp) > ¿ sub %r12,%r14 > ¿ sub %r12,-0x40(%rbp) > > That is the overhead of __arch_wb_cache_pmem() i.e. issuing CPU > cache flushes after each memset. Ideally this would be non-temporal and skip the second flush loop altogether. Outside of that another problem is that this cpu does not support the clwb instruction and is instead using the serializing and invalidating clflush instruction. > None of these pmem memory operations are optimised yet - the > implementations are correct, but performance still needs work. The > conversion to non-temporal stores should get rid of this cache flush > overhead (somewhat), but I was still expecting this code to be much > closer to memset speed and not reduce performance to bugger all... > >> The impact is less with a smaller buffer size so the above is just meant >> to illustrate the point. FWIW, I'm also fine with getting this in as a >> matter of "correctness before performance" since this stuff is clearly >> still under development, but as far as I can see so far we should >> probably ultimately prefer unwritten extents for DAX/DIO (or at least >> plan to run some similar tests on real pmem hw). Thoughts? > > We're going to have these problems initially, but from the XFS > persepective those problems don't matter because we have a different > problem to solve. That is, we need to move towards an architecture > that is highly efficient for low latency, high IOPS storage > subsystems. The first step towards that is to be able to offload > block zeroing to the hardware so we can avoid using unwritten > extents. > > In the long run, we don't want to use unwritten extents on DAX if we > can avoid it - the CPU overhead of unwritten extent conversion isn't > constant (i.e. it grows with the size of the BMBT) and it isn't > deterministic (e.g. split/merge take much more CPU than a simple > record write to clear an unwritten flag). We don't notice this > overhead much with normal IO because of the fact that the CPU time > for conversion is much less than the CPU time for the IO to > complete, hence it's a win. > > But for PMEM, directly zeroing a 4k chunk of data should take *much > less CPU* than synchronously starting a transaction, reserving > space, looking up the extent in the BMBT, loading the buffers from > cache, modifying the buffers, logging the changes and committing the > transaction (which in itself will typically copy more than a single > page of metadata into the CIL buffers). > > Realistically, dax_clear_blocks() should probably be implemented at > the pmem driver layer through blkdev_issue_zeroout() because all it > does is directly map the sector/len to pfn via bdev_direct_access() > and then zero it - it's a sector based, block device operation. We > don't actually need a special case path for DAX here. Optimisation > of this operation has little to do with the filesystem. > > This comes back to the comments I made w.r.t. the pmem driver > implementation doing synchronous IO by immediately forcing CPU cache > flushes and barriers. it's obviously correct, but it looks like > there's going to be a major performance penalty associated with it. > This is why I recently suggested that a pmem driver that doesn't do > CPU cache writeback during IO but does it on REQ_FLUSH is an > architecture we'll likely have to support. > The only thing we can realistically delay is wmb_pmem() i.e. the final sync waiting for data that has *left* the cpu cache. Unless/until we get a architecturally guaranteed method to write-back the entire cache, or flush the cache by physical-cache-way we're stuck with either non-temporal cycles or looping on potentially huge virtual address ranges. > In this case, the block device won't need to flush CPU cache lines > for the zeroed blocks until the allocation transaction is actually > committed to the journal. In that case, there's a good chance that > we'd end up also committing the new data as well, hence avoiding two > synchronous memory writes. i.e. the "big hammer" REQ_FLUSH > implementation may well be *much* faster than fine-grained > synchronous "stable on completion" writes for persistent memory. I can only see it being faster in the case where the flush is cheap to initiate. That's not the case yet so we're stuck doing it synchronously. > > This, however, is not really a problem for the filesystem - it's > a pmem driver architecture problem. ;) > It's a platform problem. Let's see how this looks when not using clflush instructions. Also, another benefit of pushing zeroing down into the driver is that for brd, as used in this example, it will rightly be a nop because there's no persistence to guarantee there. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX 2015-11-03 3:53 ` Dan Williams @ 2015-11-03 5:04 ` Dave Chinner 2015-11-04 0:50 ` Ross Zwisler 0 siblings, 1 reply; 10+ messages in thread From: Dave Chinner @ 2015-11-03 5:04 UTC (permalink / raw) To: Dan Williams Cc: Jan Kara, linux-nvdimm@lists.01.org, Brian Foster, xfs, linux-fsdevel, Ross Zwisler On Mon, Nov 02, 2015 at 07:53:27PM -0800, Dan Williams wrote: > On Mon, Nov 2, 2015 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote: > > [add people to the cc list] > > > > On Mon, Nov 02, 2015 at 09:15:10AM -0500, Brian Foster wrote: > >> On Mon, Nov 02, 2015 at 12:14:33PM +1100, Dave Chinner wrote: > >> > On Fri, Oct 30, 2015 at 08:36:57AM -0400, Brian Foster wrote: > >> > > Unless there is some > >> > > special mixed dio/mmap case I'm missing, doing so for DAX/DIO basically > >> > > causes a clear_pmem() over every page sized chunk of the target I/O > >> > > range for which we already have the data. > >> > > >> > I don't follow - this only zeros blocks when we do allocation of new > >> > blocks or overwrite unwritten extents, not on blocks which we > >> > already have written data extents allocated for... > >> > > >> > >> Why are we assuming that block zeroing is more efficient than unwritten > >> extents for DAX/dio? I haven't played with pmem enough to know for sure > >> one way or another (or if hw support is imminent), but I'd expect the > >> latter to be more efficient in general without any kind of hardware > >> support. > >> > >> Just as an example, here's an 8GB pwrite test, large buffer size, to XFS > >> on a ramdisk mounted with '-o dax:' > >> > >> - Before this series: > >> > >> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file > >> wrote 8589934592/8589934592 bytes at offset 0 > >> 8.000 GiB, 820 ops; 0:00:04.00 (1.909 GiB/sec and 195.6591 ops/sec) > >> > >> - After this series: > >> > >> # xfs_io -fc "truncate 0" -c "pwrite -b 10m 0 8g" /mnt/file > >> wrote 8589934592/8589934592 bytes at offset 0 > >> 8.000 GiB, 820 ops; 0:00:12.00 (659.790 MiB/sec and 66.0435 ops/sec) > > > > That looks wrong. Much, much slower than it should be just zeroing > > pages and then writing to them again while cache hot. > > > > Oh, hell - dax_clear_blocks() is stupidly slow. A profile shows this > > loop spending most of the CPU time: > > > > ¿ ¿ jbe ea > > ¿ de: clflus %ds:(%rax) > > 84.67 ¿ add %rsi,%rax > > ¿ cmp %rax,%rdx > > ¿ ¿ ja de > > ¿ ea: add %r13,-0x38(%rbp) > > ¿ sub %r12,%r14 > > ¿ sub %r12,-0x40(%rbp) > > > > That is the overhead of __arch_wb_cache_pmem() i.e. issuing CPU > > cache flushes after each memset. > > Ideally this would be non-temporal and skip the second flush loop > altogether. Outside of that another problem is that this cpu does not > support the clwb instruction and is instead using the serializing and > invalidating clflush instruction. Sure, it can be optimised to improve behaviour, and other hardware will be more performant, but we'll still be doing synchronous cache flushing operations here and so the fundamental problem still exists. > > This comes back to the comments I made w.r.t. the pmem driver > > implementation doing synchronous IO by immediately forcing CPU cache > > flushes and barriers. it's obviously correct, but it looks like > > there's going to be a major performance penalty associated with it. > > This is why I recently suggested that a pmem driver that doesn't do > > CPU cache writeback during IO but does it on REQ_FLUSH is an > > architecture we'll likely have to support. > > > > The only thing we can realistically delay is wmb_pmem() i.e. the final > sync waiting for data that has *left* the cpu cache. Unless/until we > get a architecturally guaranteed method to write-back the entire > cache, or flush the cache by physical-cache-way we're stuck with > either non-temporal cycles or looping on potentially huge virtual > address ranges. I'm missing something: why won't flushing the address range returned by bdev_direct_access() during a fsync operation work? i.e. we're working with exactly the same address as dax_clear_blocks() and dax_do_io() use, so why can't we look up that address and flush it from fsync? > > This, however, is not really a problem for the filesystem - it's > > a pmem driver architecture problem. ;) > > > > It's a platform problem. Let's see how this looks when not using > clflush instructions. Yes, clflush vs clwb is that's a platform problem. However, the architectural problem I've refered to is that is we've designed the DAX infrastructure around a CPU implementation that requires synchronous memory flushes for persistence, and so driven that synchronous flush requirement into the DAX implementation itself.... > Also, another benefit of pushing zeroing down into the driver is that > for brd, as used in this example, it will rightly be a nop because > there's no persistence to guarantee there. Precisely my point - different drivers and hardware have different semantics and optimisations, and only by separating the data copying from the persistence model do we end up with infrastructure that is flexible enough to work with different hardware/driver pmem models... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX 2015-11-03 5:04 ` Dave Chinner @ 2015-11-04 0:50 ` Ross Zwisler 2015-11-04 1:02 ` Dan Williams 0 siblings, 1 reply; 10+ messages in thread From: Ross Zwisler @ 2015-11-04 0:50 UTC (permalink / raw) To: Dave Chinner Cc: Dan Williams, Brian Foster, Ross Zwisler, Jan Kara, xfs, linux-fsdevel, linux-nvdimm@lists.01.org On Tue, Nov 03, 2015 at 04:04:13PM +1100, Dave Chinner wrote: > On Mon, Nov 02, 2015 at 07:53:27PM -0800, Dan Williams wrote: > > On Mon, Nov 2, 2015 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote: <> > > > This comes back to the comments I made w.r.t. the pmem driver > > > implementation doing synchronous IO by immediately forcing CPU cache > > > flushes and barriers. it's obviously correct, but it looks like > > > there's going to be a major performance penalty associated with it. > > > This is why I recently suggested that a pmem driver that doesn't do > > > CPU cache writeback during IO but does it on REQ_FLUSH is an > > > architecture we'll likely have to support. > > > > > > > The only thing we can realistically delay is wmb_pmem() i.e. the final > > sync waiting for data that has *left* the cpu cache. Unless/until we > > get a architecturally guaranteed method to write-back the entire > > cache, or flush the cache by physical-cache-way we're stuck with > > either non-temporal cycles or looping on potentially huge virtual > > address ranges. > > I'm missing something: why won't flushing the address range returned > by bdev_direct_access() during a fsync operation work? i.e. we're > working with exactly the same address as dax_clear_blocks() and > dax_do_io() use, so why can't we look up that address and flush it > from fsync? I could be wrong, but I don't see a reason why DAX can't use the strategy of writing data and marking it dirty in one step and then flushing later in response to fsync/msync. I think this could be used everywhere we write or zero data - dax_clear_blocks(), dax_io() etc. (I believe that lots of the block zeroing code will go away once we have the XFS and ext4 patches in that guarantee we will only get written and zeroed extents from the filesystem in response to get_block().) I think the PMEM driver, lacking the ability to mark things as dirty in the radix tree, etc, will need to keep doing things synchronously. Hmm...if we go this path, though, is that an argument against moving the zeroing from DAX down into the driver? True, with BRD it makes things nice and efficient because you can zero and never flush, and the driver knows there's nothing else to do. For PMEM, though, you lose the ability to zero the data and then queue the flushing for later, as you would be able to do if you left the zeroing code in DAX. The benefit of this is that if you are going to immediately re-write the newly zeroed data (which seems common), PMEM will end up doing an extra cache flush of the zeroes, only to have them overwritten and marked as dirty by DAX. If we leave the zeroing to DAX we can mark it dirty once, zero it once, write it once, and flush it once. This would make us lose the ability to do hardware-assisted flushing in the future that requires driver specific knowledge, though I don't think that exists yet. Perhaps we should leave the zeroing in DAX for now to take advantage of the single flush, and then move it down if a driver can improve performance with hardware assisted PMEM zeroing? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX 2015-11-04 0:50 ` Ross Zwisler @ 2015-11-04 1:02 ` Dan Williams 2015-11-04 4:46 ` Ross Zwisler 0 siblings, 1 reply; 10+ messages in thread From: Dan Williams @ 2015-11-04 1:02 UTC (permalink / raw) To: Ross Zwisler, Dave Chinner, Dan Williams, Brian Foster, Jan Kara, xfs, linux-fsdevel, linux-nvdimm@lists.01.org On Tue, Nov 3, 2015 at 4:50 PM, Ross Zwisler <ross.zwisler@linux.intel.com> wrote: > On Tue, Nov 03, 2015 at 04:04:13PM +1100, Dave Chinner wrote: >> On Mon, Nov 02, 2015 at 07:53:27PM -0800, Dan Williams wrote: >> > On Mon, Nov 2, 2015 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote: > <> >> > > This comes back to the comments I made w.r.t. the pmem driver >> > > implementation doing synchronous IO by immediately forcing CPU cache >> > > flushes and barriers. it's obviously correct, but it looks like >> > > there's going to be a major performance penalty associated with it. >> > > This is why I recently suggested that a pmem driver that doesn't do >> > > CPU cache writeback during IO but does it on REQ_FLUSH is an >> > > architecture we'll likely have to support. >> > > >> > >> > The only thing we can realistically delay is wmb_pmem() i.e. the final >> > sync waiting for data that has *left* the cpu cache. Unless/until we >> > get a architecturally guaranteed method to write-back the entire >> > cache, or flush the cache by physical-cache-way we're stuck with >> > either non-temporal cycles or looping on potentially huge virtual >> > address ranges. >> >> I'm missing something: why won't flushing the address range returned >> by bdev_direct_access() during a fsync operation work? i.e. we're >> working with exactly the same address as dax_clear_blocks() and >> dax_do_io() use, so why can't we look up that address and flush it >> from fsync? > > I could be wrong, but I don't see a reason why DAX can't use the strategy of > writing data and marking it dirty in one step and then flushing later in > response to fsync/msync. I think this could be used everywhere we write or > zero data - dax_clear_blocks(), dax_io() etc. (I believe that lots of the > block zeroing code will go away once we have the XFS and ext4 patches in that > guarantee we will only get written and zeroed extents from the filesystem in > response to get_block().) I think the PMEM driver, lacking the ability to > mark things as dirty in the radix tree, etc, will need to keep doing things > synchronously. Not without numbers showing the relative performance of dirtying cache followed by flushing vs non-temporal + pcommit. > Hmm...if we go this path, though, is that an argument against moving the > zeroing from DAX down into the driver? True, with BRD it makes things nice > and efficient because you can zero and never flush, and the driver knows > there's nothing else to do. > > For PMEM, though, you lose the ability to zero the data and then queue the > flushing for later, as you would be able to do if you left the zeroing code in > DAX. The benefit of this is that if you are going to immediately re-write the > newly zeroed data (which seems common), PMEM will end up doing an extra cache > flush of the zeroes, only to have them overwritten and marked as dirty by DAX. > If we leave the zeroing to DAX we can mark it dirty once, zero it once, write > it once, and flush it once. Why do we lose the ability to flush later if the driver supports blkdev_issue_zeroout? > This would make us lose the ability to do hardware-assisted flushing in the > future that requires driver specific knowledge, though I don't think that > exists yet. ioatdma has supported memset() for a while now, but I would prioritize a non-temporal SIMD implementation first. > Perhaps we should leave the zeroing in DAX for now to take > advantage of the single flush, and then move it down if a driver can improve > performance with hardware assisted PMEM zeroing? Not convinced. I think we should implement the driver zeroing solution and take a look at performance. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX 2015-11-04 1:02 ` Dan Williams @ 2015-11-04 4:46 ` Ross Zwisler 2015-11-04 9:06 ` Jan Kara 0 siblings, 1 reply; 10+ messages in thread From: Ross Zwisler @ 2015-11-04 4:46 UTC (permalink / raw) To: Dan Williams Cc: Ross Zwisler, Dave Chinner, Brian Foster, Jan Kara, xfs, linux-fsdevel, linux-nvdimm@lists.01.org On Tue, Nov 03, 2015 at 05:02:34PM -0800, Dan Williams wrote: > On Tue, Nov 3, 2015 at 4:50 PM, Ross Zwisler > <ross.zwisler@linux.intel.com> wrote: > > On Tue, Nov 03, 2015 at 04:04:13PM +1100, Dave Chinner wrote: > >> On Mon, Nov 02, 2015 at 07:53:27PM -0800, Dan Williams wrote: > >> > On Mon, Nov 2, 2015 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote: > > <> > >> > > This comes back to the comments I made w.r.t. the pmem driver > >> > > implementation doing synchronous IO by immediately forcing CPU cache > >> > > flushes and barriers. it's obviously correct, but it looks like > >> > > there's going to be a major performance penalty associated with it. > >> > > This is why I recently suggested that a pmem driver that doesn't do > >> > > CPU cache writeback during IO but does it on REQ_FLUSH is an > >> > > architecture we'll likely have to support. > >> > > > >> > > >> > The only thing we can realistically delay is wmb_pmem() i.e. the final > >> > sync waiting for data that has *left* the cpu cache. Unless/until we > >> > get a architecturally guaranteed method to write-back the entire > >> > cache, or flush the cache by physical-cache-way we're stuck with > >> > either non-temporal cycles or looping on potentially huge virtual > >> > address ranges. > >> > >> I'm missing something: why won't flushing the address range returned > >> by bdev_direct_access() during a fsync operation work? i.e. we're > >> working with exactly the same address as dax_clear_blocks() and > >> dax_do_io() use, so why can't we look up that address and flush it > >> from fsync? > > > > I could be wrong, but I don't see a reason why DAX can't use the strategy of > > writing data and marking it dirty in one step and then flushing later in > > response to fsync/msync. I think this could be used everywhere we write or > > zero data - dax_clear_blocks(), dax_io() etc. (I believe that lots of the > > block zeroing code will go away once we have the XFS and ext4 patches in that > > guarantee we will only get written and zeroed extents from the filesystem in > > response to get_block().) I think the PMEM driver, lacking the ability to > > mark things as dirty in the radix tree, etc, will need to keep doing things > > synchronously. > > Not without numbers showing the relative performance of dirtying cache > followed by flushing vs non-temporal + pcommit. Sorry - do you mean that you want to make sure that we get a performance benefit from the "dirty and flush later" path vs the "write and flush now" path? Sure, that seems reasonable. > > Hmm...if we go this path, though, is that an argument against moving the > > zeroing from DAX down into the driver? True, with BRD it makes things nice > > and efficient because you can zero and never flush, and the driver knows > > there's nothing else to do. > > > > For PMEM, though, you lose the ability to zero the data and then queue the > > flushing for later, as you would be able to do if you left the zeroing code in > > DAX. The benefit of this is that if you are going to immediately re-write the > > newly zeroed data (which seems common), PMEM will end up doing an extra cache > > flush of the zeroes, only to have them overwritten and marked as dirty by DAX. > > If we leave the zeroing to DAX we can mark it dirty once, zero it once, write > > it once, and flush it once. > > Why do we lose the ability to flush later if the driver supports > blkdev_issue_zeroout? I think that if you implement zeroing in the driver you'd need to also flush in the driver because you wouldn't have access to the radix tree to be able to mark entries as dirty so you can flush them later. As I think about this more, though, I'm not sure that having the zeroing flush later could work. I'm guessing that the filesystem must require a sync point between the zeroing and the subsequent follow-up writes so that you can sync metadata for the block allocation. Otherwise you could end up in a situation where you've got your metadata pointing at newly allocated blocks but the new zeros are still in the processor cache - if you lose power you've just created an information leak. Dave, Jan, does this make sense? > > This would make us lose the ability to do hardware-assisted flushing in the > > future that requires driver specific knowledge, though I don't think that > > exists yet. > > ioatdma has supported memset() for a while now, but I would prioritize > a non-temporal SIMD implementation first. Sweet, didn't know about that, obviously. :) Thanks for the pointer. > > Perhaps we should leave the zeroing in DAX for now to take > > advantage of the single flush, and then move it down if a driver can improve > > performance with hardware assisted PMEM zeroing? > > Not convinced. I think we should implement the driver zeroing > solution and take a look at performance. I agree, this should all be driven by performance measurements. Thanks for the feedback. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX 2015-11-04 4:46 ` Ross Zwisler @ 2015-11-04 9:06 ` Jan Kara 2015-11-04 15:35 ` Ross Zwisler 0 siblings, 1 reply; 10+ messages in thread From: Jan Kara @ 2015-11-04 9:06 UTC (permalink / raw) To: Ross Zwisler Cc: Dan Williams, Dave Chinner, Brian Foster, Jan Kara, xfs, linux-fsdevel, linux-nvdimm@lists.01.org On Tue 03-11-15 21:46:13, Ross Zwisler wrote: > On Tue, Nov 03, 2015 at 05:02:34PM -0800, Dan Williams wrote: > > > Hmm...if we go this path, though, is that an argument against moving the > > > zeroing from DAX down into the driver? True, with BRD it makes things nice > > > and efficient because you can zero and never flush, and the driver knows > > > there's nothing else to do. > > > > > > For PMEM, though, you lose the ability to zero the data and then queue the > > > flushing for later, as you would be able to do if you left the zeroing code in > > > DAX. The benefit of this is that if you are going to immediately re-write the > > > newly zeroed data (which seems common), PMEM will end up doing an extra cache > > > flush of the zeroes, only to have them overwritten and marked as dirty by DAX. > > > If we leave the zeroing to DAX we can mark it dirty once, zero it once, write > > > it once, and flush it once. > > > > Why do we lose the ability to flush later if the driver supports > > blkdev_issue_zeroout? > > I think that if you implement zeroing in the driver you'd need to also > flush in the driver because you wouldn't have access to the radix tree to > be able to mark entries as dirty so you can flush them later. > > As I think about this more, though, I'm not sure that having the zeroing > flush later could work. I'm guessing that the filesystem must require a > sync point between the zeroing and the subsequent follow-up writes so > that you can sync metadata for the block allocation. Otherwise you could > end up in a situation where you've got your metadata pointing at newly > allocated blocks but the new zeros are still in the processor cache - if > you lose power you've just created an information leak. Dave, Jan, does > this make sense? So the problem you describe does not exist. Thing to keep in mind is that filesystem are designed to work reliably with 'non-persistent' cache in the disk which is common these days. That's why we bother with all that REQ_FLUSH | REQ_FUA and blkdev_issue_flush() stuff after all. Processor cache is exactly that kind of the cache attached to the PMEM storage. And Dave and I try to steer you to a solution that would also treat it equally in DAX filesystems as well :). Now how the problem is currently solved: When we allocate blocks, we just record that information in a transaction in the journal. For DAX case we also submit the IO zeroing those blocks and wait for it. Now if we crash before the transaction gets committed, blocks won't be seen in the inode after a journal recovery and thus no data exposure can happen. As a part of transaction commit, we call blkdev_issue_flush() (or submit REQ_FLUSH request). We expect that to force out all the IO in volatile caches into the persistent storage. So this will also force the zeroing into persistent storage for normal disks and AFAIU if you do zeroing with non-temporal writes in pmem driver and then do wmb_pmem() in response to a flush request we get the same persistency guarantee in pmem case as well. So after a transaction commit we are guaranteed to see zeros in those allocated blocks. So the transaction commit and the corresponding flush request in particular is the sync point you speak about above but the good thing is that in most cases this will happen after real data gets written into those blocks so we save the unnecessary flush. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX 2015-11-04 9:06 ` Jan Kara @ 2015-11-04 15:35 ` Ross Zwisler 2015-11-04 17:21 ` Jan Kara 0 siblings, 1 reply; 10+ messages in thread From: Ross Zwisler @ 2015-11-04 15:35 UTC (permalink / raw) To: Jan Kara Cc: Ross Zwisler, Dan Williams, Dave Chinner, Brian Foster, xfs, linux-fsdevel, linux-nvdimm@lists.01.org On Wed, Nov 04, 2015 at 10:06:12AM +0100, Jan Kara wrote: > On Tue 03-11-15 21:46:13, Ross Zwisler wrote: > > On Tue, Nov 03, 2015 at 05:02:34PM -0800, Dan Williams wrote: > > > > Hmm...if we go this path, though, is that an argument against moving the > > > > zeroing from DAX down into the driver? True, with BRD it makes things nice > > > > and efficient because you can zero and never flush, and the driver knows > > > > there's nothing else to do. > > > > > > > > For PMEM, though, you lose the ability to zero the data and then queue the > > > > flushing for later, as you would be able to do if you left the zeroing code in > > > > DAX. The benefit of this is that if you are going to immediately re-write the > > > > newly zeroed data (which seems common), PMEM will end up doing an extra cache > > > > flush of the zeroes, only to have them overwritten and marked as dirty by DAX. > > > > If we leave the zeroing to DAX we can mark it dirty once, zero it once, write > > > > it once, and flush it once. > > > > > > Why do we lose the ability to flush later if the driver supports > > > blkdev_issue_zeroout? > > > > I think that if you implement zeroing in the driver you'd need to also > > flush in the driver because you wouldn't have access to the radix tree to > > be able to mark entries as dirty so you can flush them later. > > > > As I think about this more, though, I'm not sure that having the zeroing > > flush later could work. I'm guessing that the filesystem must require a > > sync point between the zeroing and the subsequent follow-up writes so > > that you can sync metadata for the block allocation. Otherwise you could > > end up in a situation where you've got your metadata pointing at newly > > allocated blocks but the new zeros are still in the processor cache - if > > you lose power you've just created an information leak. Dave, Jan, does > > this make sense? > > So the problem you describe does not exist. Thing to keep in mind is that > filesystem are designed to work reliably with 'non-persistent' cache in the > disk which is common these days. That's why we bother with all that > REQ_FLUSH | REQ_FUA and blkdev_issue_flush() stuff after all. Processor > cache is exactly that kind of the cache attached to the PMEM storage. And > Dave and I try to steer you to a solution that would also treat it equally > in DAX filesystems as well :). And I'm always grateful for the guidance. :) > Now how the problem is currently solved: When we allocate blocks, we just > record that information in a transaction in the journal. For DAX case we > also submit the IO zeroing those blocks and wait for it. Now if we crash > before the transaction gets committed, blocks won't be seen in the inode > after a journal recovery and thus no data exposure can happen. As a part of > transaction commit, we call blkdev_issue_flush() (or submit REQ_FLUSH > request). We expect that to force out all the IO in volatile caches into > the persistent storage. So this will also force the zeroing into persistent > storage for normal disks and AFAIU if you do zeroing with non-temporal > writes in pmem driver and then do wmb_pmem() in response to a flush request > we get the same persistency guarantee in pmem case as well. So after a > transaction commit we are guaranteed to see zeros in those allocated > blocks. > > So the transaction commit and the corresponding flush request in particular > is the sync point you speak about above but the good thing is that in most > cases this will happen after real data gets written into those blocks so we > save the unnecessary flush. Cool, thank you for the explanation, that makes sense to me. When dealing with normal SSDs and the page cache, does the filesystem keep the zeroes in the page cache, or does it issue it directly to the driver via sb_issue_zeroout()/blkdev_issue_zeroout()? If we keep it in the page cache so that the follow-up writes just update the dirty pages and we end up writing to media once, this seems like it would flow nicely into the idea of zeroing new blocks at the DAX level without flushing and just marking them as dirty in the radix tree. If the zeroing happens via sb_issue_zeroout() then this probably doesn't make sense because the existing flow won't include a fsync/msync type step of the newly zeroed data in the page cache. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX 2015-11-04 15:35 ` Ross Zwisler @ 2015-11-04 17:21 ` Jan Kara 0 siblings, 0 replies; 10+ messages in thread From: Jan Kara @ 2015-11-04 17:21 UTC (permalink / raw) To: Ross Zwisler Cc: Jan Kara, Dan Williams, Dave Chinner, Brian Foster, xfs, linux-fsdevel, linux-nvdimm@lists.01.org On Wed 04-11-15 08:35:51, Ross Zwisler wrote: > On Wed, Nov 04, 2015 at 10:06:12AM +0100, Jan Kara wrote: > > Now how the problem is currently solved: When we allocate blocks, we just > > record that information in a transaction in the journal. For DAX case we > > also submit the IO zeroing those blocks and wait for it. Now if we crash > > before the transaction gets committed, blocks won't be seen in the inode > > after a journal recovery and thus no data exposure can happen. As a part of > > transaction commit, we call blkdev_issue_flush() (or submit REQ_FLUSH > > request). We expect that to force out all the IO in volatile caches into > > the persistent storage. So this will also force the zeroing into persistent > > storage for normal disks and AFAIU if you do zeroing with non-temporal > > writes in pmem driver and then do wmb_pmem() in response to a flush request > > we get the same persistency guarantee in pmem case as well. So after a > > transaction commit we are guaranteed to see zeros in those allocated > > blocks. > > > > So the transaction commit and the corresponding flush request in particular > > is the sync point you speak about above but the good thing is that in most > > cases this will happen after real data gets written into those blocks so we > > save the unnecessary flush. > > Cool, thank you for the explanation, that makes sense to me. > > When dealing with normal SSDs and the page cache, does the filesystem keep the > zeroes in the page cache, or does it issue it directly to the driver via > sb_issue_zeroout()/blkdev_issue_zeroout()? If we keep it in the page cache so Currently we use sb_issue_zeroout() for zeroing out blocks in ext4 (for DAX and in some other cases to avoid splitting extents too much). Note that zeroing blocks in page cache won't work on it's own - blkdev_issue_flush() doesn't force out any dirty pages from page cache so transaction commit wouldn't make sure stale block contents is not exposed. However we actually do have a mechanism in ext4 to force out data from page cache during transaction commit - that is what data=ordered mode is all about - we can attach inode to a transaction and all dirty data of that inode that has underlying blocks allocated is flushed as a part of transaction commit. So this makes sure stale data cannot be seen after a crash in data=ordered mode. XFS doesn't have this mechanism and thus it normally has to put allocated blocks in unwritten extents, submit IO to write data to those blocks, and convert extent to written once IO finishes. > that the follow-up writes just update the dirty pages and we end up > writing to media once, this seems like it would flow nicely into the idea > of zeroing new blocks at the DAX level without flushing and just marking > them as dirty in the radix tree. So for ext4 you could make this work the same way data=ordered mode works - just store zeroes into allocated blocks, mark page as dirty in the radix tree, attach inode to the running transaction, and on transaction commit do the flush. However note that for DAX page faults zeroing needs to happen under fs lock serializing faults to the same page which effectively means it happens inside filesystem's block mapping function and at that place we don't have a page to zero out. So it would require considerable plumbing to make this work even for ext4 and I'm not even speaking of XFS where the flushing mechanism on transaction commit doesn't currently exist AFAIK. Frankly since the savings would be realized only for allocating writes into mmaped file which aren't than common, I'm not convinced this would be worth it. > If the zeroing happens via sb_issue_zeroout() then this probably doesn't > make sense because the existing flow won't include a fsync/msync type > step of the newly zeroed data in the page cache. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/6] xfs: Don't use unwritten extents for DAX 2015-11-02 21:44 ` [PATCH 3/6] xfs: Don't use unwritten extents for DAX Dave Chinner 2015-11-03 3:53 ` Dan Williams @ 2015-11-03 9:16 ` Jan Kara 1 sibling, 0 replies; 10+ messages in thread From: Jan Kara @ 2015-11-03 9:16 UTC (permalink / raw) To: Dave Chinner Cc: jack, linux-nvdimm, Brian Foster, xfs, linux-fsdevel, ross.zwisler, dan.j.williams On Tue 03-11-15 08:44:24, Dave Chinner wrote: > Realistically, dax_clear_blocks() should probably be implemented at > the pmem driver layer through blkdev_issue_zeroout() because all it > does is directly map the sector/len to pfn via bdev_direct_access() > and then zero it - it's a sector based, block device operation. We > don't actually need a special case path for DAX here. Optimisation > of this operation has little to do with the filesystem. Yep. This is actually what I did in ext4 - the block zeroing is using the ext4 block zeroout path which ends up calling blkdev_issue_zeroout(). I didn't want to special-case DAX and figured out that if we want performance, we should implement blkdev_issue_zeroout() efficiently for pmem. After all ext4 uses blkdev_issue_zeroout() in other extent conversion cases where zeroing out is needed. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2015-11-04 17:21 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <1445225238-30413-1-git-send-email-david@fromorbit.com> [not found] ` <1445225238-30413-4-git-send-email-david@fromorbit.com> [not found] ` <20151029142950.GE11663@bfoster.bfoster> [not found] ` <20151029233756.GS19199@dastard> [not found] ` <20151030123657.GC54905@bfoster.bfoster> [not found] ` <20151102011433.GW19199@dastard> [not found] ` <20151102141509.GA29346@bfoster.bfoster> 2015-11-02 21:44 ` [PATCH 3/6] xfs: Don't use unwritten extents for DAX Dave Chinner 2015-11-03 3:53 ` Dan Williams 2015-11-03 5:04 ` Dave Chinner 2015-11-04 0:50 ` Ross Zwisler 2015-11-04 1:02 ` Dan Williams 2015-11-04 4:46 ` Ross Zwisler 2015-11-04 9:06 ` Jan Kara 2015-11-04 15:35 ` Ross Zwisler 2015-11-04 17:21 ` Jan Kara 2015-11-03 9:16 ` Jan Kara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).