* Slow deduplication
@ 2025-03-02 8:47 Steinar H. Gunderson
2025-03-02 21:35 ` Dave Chinner
0 siblings, 1 reply; 6+ messages in thread
From: Steinar H. Gunderson @ 2025-03-02 8:47 UTC (permalink / raw)
To: linux-xfs
Hi,
I'm investigating XFS block-level deduplication via reflink (FIDEDUPERANGE),
and I'm trying to figure out some performance problems I've got. I have a
fresh filesystem of about 4–8 TB (made with mkfs.xfs 6.1.0) that I copied
data into a few days ago, and I'm running 6.13.0-rc4 (since that was the most
recent when I last had the change to boot; I believe I've seen this before
with older kernels, so I don't think this is a regression).
The underlying block device is an LVM volume on top of a RAID-6, and when
I read sequentially from large files, it gives me roughly 1.1 GB/sec
(although not completely evenly). My deduplication code works in mostly
the obvious way, in that it first reads files, hashes blocks from them,
then figures out (through some algorithms that are not important here) what
file ranges should be deduplicated. And the latter part is slow; almost so
slow as to be unusable.
For instance, I have 13 files of about 10 GB each that happen to be identical
save for the first 20 kB. My program has identified this, and calls
ioctl(FIDEDUPERANGE) with one of the files as source and the other 12
as destinations, in consecutive 16 MB chunks (since that's what
ioctl_fideduprange(2) recommends; I also tried simply a single 10 GB call
earlier, but it was no faster and also stopped after the first gigabyte);
strace gives:
ioctl(637, BTRFS_IOC_FILE_EXTENT_SAME or FIDEDUPERANGE,
{src_offset=4294971392, src_length=16777216, dest_count=12,
info=[{dest_fd=638, dest_offset=4294971392},
{dest_fd=639, dest_offset=4294971392},
{dest_fd=640, dest_offset=4294971392},
{dest_fd=641, dest_offset=4294971392},
{dest_fd=642, dest_offset=4294971392},
{dest_fd=643, dest_offset=4294971392},
{dest_fd=644, dest_offset=4294971392},
{dest_fd=645, dest_offset=4294971392},
{dest_fd=646, dest_offset=4294971392},
{dest_fd=647, dest_offset=4294971392},
{dest_fd=648, dest_offset=4294971392},
{dest_fd=649, dest_offset=4294971392}]}
This ioctl call successfully deduplicated the data, but it took 71.52 _seconds_.
Deduplicating the entire set is on the order of days. I don't understand why
this would take so much time; I understand that it needs to make a read to
verify that the file ranges are indeed the same (this is the only sane API
design!), but it comes out to something like 2800 kB/sec from an array that
can deliver almost 400 times that. There is no other activity on the file
system in question, so it should not conflict with other activity (locks
etc.), and the process does not appear to be taking significant amounts of
CPU time. iostat shows read activity varying from maybe 300 kB/sec to
12000 kB/sec or so; /proc/<pid>/stack says:
[<0>] folio_wait_bit_common+0x174/0x220
[<0>] filemap_read_folio+0x64/0x8b
[<0>] do_read_cache_folio+0x119/0x164
[<0>] __generic_remap_file_range_prep+0x372/0x568
[<0>] generic_remap_file_range_prep+0x7/0xd
[<0>] xfs_reflink_remap_prep+0xb7/0x223 [xfs]
[<0>] xfs_file_remap_range+0x94/0x248 [xfs]
[<0>] vfs_dedupe_file_range_one+0x145/0x181
[<0>] vfs_dedupe_file_range+0x14d/0x1ca
[<0>] do_vfs_ioctl+0x483/0x8a4
[<0>] __do_sys_ioctl+0x51/0x83
[<0>] do_syscall_64+0x76/0xd8
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
Is there anything I can do to speed this up? Is there simply some sort of
bug that causes it to be so slow?
/* Steinar */
--
Homepage: https://www.sesse.net/
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Slow deduplication
2025-03-02 8:47 Slow deduplication Steinar H. Gunderson
@ 2025-03-02 21:35 ` Dave Chinner
2025-03-02 21:49 ` Steinar H. Gunderson
0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2025-03-02 21:35 UTC (permalink / raw)
To: Steinar H. Gunderson; +Cc: linux-xfs
On Sun, Mar 02, 2025 at 09:47:10AM +0100, Steinar H. Gunderson wrote:
> This ioctl call successfully deduplicated the data, but it took 71.52 _seconds_.
> Deduplicating the entire set is on the order of days. I don't understand why
> this would take so much time; I understand that it needs to make a read to
> verify that the file ranges are indeed the same (this is the only sane API
> design!), but it comes out to something like 2800 kB/sec from an array that
> can deliver almost 400 times that. There is no other activity on the file
> system in question, so it should not conflict with other activity (locks
> etc.), and the process does not appear to be taking significant amounts of
> CPU time. iostat shows read activity varying from maybe 300 kB/sec to
> 12000 kB/sec or so; /proc/<pid>/stack says:
>
> [<0>] folio_wait_bit_common+0x174/0x220
> [<0>] filemap_read_folio+0x64/0x8b
> [<0>] do_read_cache_folio+0x119/0x164
> [<0>] __generic_remap_file_range_prep+0x372/0x568
> [<0>] generic_remap_file_range_prep+0x7/0xd
This does comparison one folio at a time and does no readahead.
Hence if the data isn't already in cache, it is doing synchronous
small reads and waiting for every single one of them. This really
should use an internal interface that is capable of issuing
readahead...
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Slow deduplication
2025-03-02 21:35 ` Dave Chinner
@ 2025-03-02 21:49 ` Steinar H. Gunderson
2025-03-03 14:03 ` Christoph Hellwig
0 siblings, 1 reply; 6+ messages in thread
From: Steinar H. Gunderson @ 2025-03-02 21:49 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs
On Mon, Mar 03, 2025 at 08:35:57AM +1100, Dave Chinner wrote:
> This does comparison one folio at a time and does no readahead.
> Hence if the data isn't already in cache, it is doing synchronous
> small reads and waiting for every single one of them. This really
> should use an internal interface that is capable of issuing
> readahead...
Yes, I noticed that if I do dummy read() of each extent first,
it becomes _massively_ faster. I'm not sure if I trust posix_fadvise()
to just to MADV_WILLNEED given the manpage; would it work (and give
roughly the same readahead that read() seems to be doing)?
After 12 hours or so of this massive I/O, seemingly the page cache fragments
really hard and I'm left using 99% in xas_* functions (on read()) until I do
drop_caches and it clears up again. I'm not sure if this is deduplication-related
or not. :-)
/* Steinar */
--
Homepage: https://www.sesse.net/
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Slow deduplication
2025-03-02 21:49 ` Steinar H. Gunderson
@ 2025-03-03 14:03 ` Christoph Hellwig
2025-03-06 0:35 ` Christoph Hellwig
0 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2025-03-03 14:03 UTC (permalink / raw)
To: Steinar H. Gunderson; +Cc: Dave Chinner, linux-xfs
On Sun, Mar 02, 2025 at 10:49:33PM +0100, Steinar H. Gunderson wrote:
> On Mon, Mar 03, 2025 at 08:35:57AM +1100, Dave Chinner wrote:
> > This does comparison one folio at a time and does no readahead.
> > Hence if the data isn't already in cache, it is doing synchronous
> > small reads and waiting for every single one of them. This really
> > should use an internal interface that is capable of issuing
> > readahead...
>
> Yes, I noticed that if I do dummy read() of each extent first,
> it becomes _massively_ faster. I'm not sure if I trust posix_fadvise()
> to just to MADV_WILLNEED given the manpage; would it work (and give
> roughly the same readahead that read() seems to be doing)?
The right thing to do it to just issue readahead in
vfs_dedupe_file_range_compare. The ractl structure is a bit odd so
it'll need slightky more careful thoughts than just a hacked up
one-liner, but it should still be realtively simple. I can look into
it once I find a little time if no one beats me to it.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Slow deduplication
2025-03-03 14:03 ` Christoph Hellwig
@ 2025-03-06 0:35 ` Christoph Hellwig
2025-03-06 8:17 ` Steinar H. Gunderson
0 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2025-03-06 0:35 UTC (permalink / raw)
To: Steinar H. Gunderson; +Cc: Dave Chinner, linux-xfs
On Mon, Mar 03, 2025 at 06:03:07AM -0800, Christoph Hellwig wrote:
> The right thing to do it to just issue readahead in
> vfs_dedupe_file_range_compare. The ractl structure is a bit odd so
> it'll need slightky more careful thoughts than just a hacked up
> one-liner, but it should still be realtively simple. I can look into
> it once I find a little time if no one beats me to it.
I gave it a quick try yesterday, but it turns out XFS hold the
invalidate_lock over dedup, and the readahead code also wants to
take it. So the simple use of the readahead code doesn't work.
But as dedup only needs a tiny subset of the readahead algorithm
it might be possible to simply open code it. I'll see what I can
do when I find a little more time for it.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Slow deduplication
2025-03-06 0:35 ` Christoph Hellwig
@ 2025-03-06 8:17 ` Steinar H. Gunderson
0 siblings, 0 replies; 6+ messages in thread
From: Steinar H. Gunderson @ 2025-03-06 8:17 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs
On Wed, Mar 05, 2025 at 04:35:34PM -0800, Christoph Hellwig wrote:
> I gave it a quick try yesterday, but it turns out XFS hold the
> invalidate_lock over dedup, and the readahead code also wants to
> take it. So the simple use of the readahead code doesn't work.
> But as dedup only needs a tiny subset of the readahead algorithm
> it might be possible to simply open code it. I'll see what I can
> do when I find a little more time for it.
Thanks for looking into this. I figured I had to always do read()
first to support older kernels anyway, but I guess it would be good
to get this fixed for the point where today's kernels have become
older kernels :-) (Perhaps this hasn't been noticed before since
most deduplication software does its own read() and comparison
before calling the ioctl? Just a guess.)
/* Steinar */
--
Homepage: https://www.sesse.net/
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-03-06 8:17 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-02 8:47 Slow deduplication Steinar H. Gunderson
2025-03-02 21:35 ` Dave Chinner
2025-03-02 21:49 ` Steinar H. Gunderson
2025-03-03 14:03 ` Christoph Hellwig
2025-03-06 0:35 ` Christoph Hellwig
2025-03-06 8:17 ` Steinar H. Gunderson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox