* xfs over pmem - cp performance
@ 2016-01-08 21:07 Elliott, Robert (Persistent Memory)
2016-01-08 22:03 ` Dave Chinner
0 siblings, 1 reply; 3+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2016-01-08 21:07 UTC (permalink / raw)
To: david@fromorbit.com
Cc: linux-fsdevel@vger.kernel.org, linux-nvdimm@lists.01.org
I tried using cp to copy the linux git tree between
pmem devices like this:
cp -r /mnt/xfs-pmem1/linux /mnt/xfs-pmem2
The time taken by various filesystems varies (4.4-rc5):
* xfs w/dax: 42 s
* xfs no dax: 14 s
* ext4 w/dax: 7 s
* ext4 no dax: 15 s
* btrfs no dax: 18 s
mount options:
* /dev/pmem1 on /mnt/xfs-pmem1 type xfs (rw,relatime,seclabel,attr2,dax,inode64,noquota)
* /dev/pmem1 on /mnt/ext4-pmem1 type ext4 (rw,relatime,seclabel,dax,data=ordered)
* /dev/pmem1 on /mnt/btrfs-pmem1 type btrfs (rw,relatime,seclabel,ssd,space_cache,subvolid=5,subvol=/)
xfs with dax spends most of the time in clear_page_c_e and
dax-clear_blocks (from "perf top"):
30.06% [kernel] [k] clear_page_c_e
12.24% [kernel] [k] dax_clear_blocks
5.36% [kernel] [k] copy_user_enhanced_fast_string
4.33% [kernel] [k] __copy_user_nocache
2.55% [xfs] [k] xfs_perag_put
1.77% [kernel] [k] security_compute_sid.part.12
1.19% [kernel] [k] __percpu_counter_sum
1.14% [kernel] [k] acpi_os_write_port
1.03% [kernel] [k] dax_do_io
1.00% [kernel] [k] _raw_spin_lock
The others spend most of their time in the
copy_user_enhanced_fast_string and __copy_user_nocache
functions that actually copy data.
xfs without dax:
28.82% [kernel] [k] copy_user_enhanced_fast_string
7.48% [kernel] [k] __copy_user_nocache
3.63% [kernel] [k] __block_commit_write.isra.22
1.86% [kernel] [k] acpi_os_write_port
1.72% [kernel] [k] filenametr_cmp
1.48% [kernel] [k] hashtab_search
1.28% [kernel] [k] security_compute_sid.part.12
0.96% [kernel] [k] _raw_spin_lock
ext4 with dax:
22.85% [kernel] [k] __copy_user_nocache
22.51% [kernel] [k] copy_user_enhanced_fast_string
4.15% [kernel] [k] mb_find_order_for_block
3.03% [kernel] [k] dax_do_io
2.08% [kernel] [k] __d_lookup_rcu
1.85% [kernel] [k] mb_find_extent
1.75% [kernel] [k] ext4_mark_iloc_dirty
1.54% [kernel] [k] acpi_os_write_port
1.15% [kernel] [k] _find_next_bit.part.0
0.99% [kernel] [k] ext4_mb_good_group
ext4 without dax:
29.89% [kernel] [k] copy_user_enhanced_fast_string
15.81% [kernel] [k] __copy_user_nocache
4.45% [kernel] [k] __block_commit_write.isra.22
1.39% [kernel] [k] ext4_mark_iloc_dirty
1.37% [kernel] [k] ext4_bio_write_page
1.12% [kernel] [k] filenametr_cmp
1.09% [kernel] [k] security_compute_sid.part.12
0.98% [kernel] [k] hashtab_search
btrfs (without dax):
14.25% [kernel] [k] copy_user_enhanced_fast_string
14.12% [kernel] [k] queued_spin_lock_slowpath
9.70% [kernel] [k] __copy_user_nocache
3.48% [kernel] [k] acpi_os_write_port
1.52% [kernel] [k] _raw_spin_lock
1.38% [kernel] [k] queued_write_lock_slowpath
1.36% [kernel] [k] _raw_spin_lock_irqsave
---
Robert Elliott, HPE Persistent Memory
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: xfs over pmem - cp performance
2016-01-08 21:07 xfs over pmem - cp performance Elliott, Robert (Persistent Memory)
@ 2016-01-08 22:03 ` Dave Chinner
2016-01-12 17:31 ` Ross Zwisler
0 siblings, 1 reply; 3+ messages in thread
From: Dave Chinner @ 2016-01-08 22:03 UTC (permalink / raw)
To: Elliott, Robert (Persistent Memory)
Cc: linux-fsdevel@vger.kernel.org, linux-nvdimm@lists.01.org
On Fri, Jan 08, 2016 at 09:07:27PM +0000, Elliott, Robert (Persistent Memory) wrote:
> I tried using cp to copy the linux git tree between
> pmem devices like this:
> cp -r /mnt/xfs-pmem1/linux /mnt/xfs-pmem2
>
> The time taken by various filesystems varies (4.4-rc5):
> * xfs w/dax: 42 s
> * xfs no dax: 14 s
> * ext4 w/dax: 7 s
> * ext4 no dax: 15 s
> * btrfs no dax: 18 s
Yes, we know.
> mount options:
> * /dev/pmem1 on /mnt/xfs-pmem1 type xfs (rw,relatime,seclabel,attr2,dax,inode64,noquota)
> * /dev/pmem1 on /mnt/ext4-pmem1 type ext4 (rw,relatime,seclabel,dax,data=ordered)
> * /dev/pmem1 on /mnt/btrfs-pmem1 type btrfs (rw,relatime,seclabel,ssd,space_cache,subvolid=5,subvol=/)
>
> xfs with dax spends most of the time in clear_page_c_e and
> dax-clear_blocks (from "perf top"):
> 30.06% [kernel] [k] clear_page_c_e
> 12.24% [kernel] [k] dax_clear_blocks
That's where the difference is - XFS is zeroing the blocks during
allocation so that we know that a failed write or crash during a
write will not expose stale data to the user. I've made comment
about this previously here:
http://oss.sgi.com/archives/xfs/2015-11/msg00021.html
and it's a result of the current "everything is synchronous" DAX cpu
cache control behaviour.
I think it's worth noting that ext4 is not spending any time
zeroing the blocks during allocation, which I think means that it
can expose stale data as a result of a crash or partial write....
We're working on fixing this, but it needs all the fsync patches
from Ross to enable us to turn off the synchronous cache flushes
in the DAX IO code.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: xfs over pmem - cp performance
2016-01-08 22:03 ` Dave Chinner
@ 2016-01-12 17:31 ` Ross Zwisler
0 siblings, 0 replies; 3+ messages in thread
From: Ross Zwisler @ 2016-01-12 17:31 UTC (permalink / raw)
To: Dave Chinner
Cc: Elliott, Robert (Persistent Memory),
linux-fsdevel@vger.kernel.org, linux-nvdimm@lists.01.org,
Jan Kara
On Sat, Jan 09, 2016 at 09:03:28AM +1100, Dave Chinner wrote:
> On Fri, Jan 08, 2016 at 09:07:27PM +0000, Elliott, Robert (Persistent Memory) wrote:
> > I tried using cp to copy the linux git tree between
> > pmem devices like this:
> > cp -r /mnt/xfs-pmem1/linux /mnt/xfs-pmem2
> >
> > The time taken by various filesystems varies (4.4-rc5):
> > * xfs w/dax: 42 s
> > * xfs no dax: 14 s
> > * ext4 w/dax: 7 s
> > * ext4 no dax: 15 s
> > * btrfs no dax: 18 s
>
> Yes, we know.
>
> > mount options:
> > * /dev/pmem1 on /mnt/xfs-pmem1 type xfs (rw,relatime,seclabel,attr2,dax,inode64,noquota)
> > * /dev/pmem1 on /mnt/ext4-pmem1 type ext4 (rw,relatime,seclabel,dax,data=ordered)
> > * /dev/pmem1 on /mnt/btrfs-pmem1 type btrfs (rw,relatime,seclabel,ssd,space_cache,subvolid=5,subvol=/)
> >
> > xfs with dax spends most of the time in clear_page_c_e and
> > dax-clear_blocks (from "perf top"):
> > 30.06% [kernel] [k] clear_page_c_e
> > 12.24% [kernel] [k] dax_clear_blocks
>
> That's where the difference is - XFS is zeroing the blocks during
> allocation so that we know that a failed write or crash during a
> write will not expose stale data to the user. I've made comment
> about this previously here:
>
> http://oss.sgi.com/archives/xfs/2015-11/msg00021.html
>
> and it's a result of the current "everything is synchronous" DAX cpu
> cache control behaviour.
>
> I think it's worth noting that ext4 is not spending any time
> zeroing the blocks during allocation, which I think means that it
> can expose stale data as a result of a crash or partial write....
Jan's patch series that does the zeroing for newly allocated blocks in ext4
hasn't been merged yet, and is queued for v4.5 inclusion:
https://git.kernel.org/cgit/linux/kernel/git/tytso/ext4.git/log/
My guess is that once this set is included the ext4 overhead for block zeroing
will go up. If you're testing v4.4 code the zeroing for newly allocated
blocks with ext4 is still happening inside of DAX.
> We're working on fixing this, but it needs all the fsync patches
> from Ross to enable us to turn off the synchronous cache flushes
> in the DAX IO code.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2016-01-12 17:31 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-08 21:07 xfs over pmem - cp performance Elliott, Robert (Persistent Memory)
2016-01-08 22:03 ` Dave Chinner
2016-01-12 17:31 ` Ross Zwisler
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox