public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* reordering file operations for performance
@ 2011-01-31  4:47 Phil Karn
  2011-01-31  5:54 ` Dave Chinner
  0 siblings, 1 reply; 2+ messages in thread
From: Phil Karn @ 2011-01-31  4:47 UTC (permalink / raw)
  To: xfs


[-- Attachment #1.1: Type: text/plain, Size: 1595 bytes --]

I have written a file deduplicator, dupmerge, that walks through a file
system (or reads a list of files from stdin), sorts them by size, and
compares each pair of the same size looking for duplicates. When it finds
two distinct files with identical contents on the same file system, it
deletes the newer copy and recreates its path name as a hard link to the
older version.

For performance it actually compares SHA1 hashes, not the actual file
contents. To avoid unnecessary full-file reads, it first compares the hashes
of the first pages (4kiB) of each file. Only if they match will I compute
and compare the full file hashes. Each file is fully read at most once and
sequentially, so if the file occupies a single extent it can be read in a
single large contiguous transfer. This is noticeably faster than doing a
direct compare, seeking between two files at opposite ends of the disk.

I am looking for additional performance enhancements, and I don't mind using
fs-specific features. E.g., I am now stashing the file hashes into xfs
extended file attributes.

I regularly run xfs_fsr and have added fallocate() calls to the major file
copy utilities, so all of my files are in single extents. Is there an easy
way to ask xfs where those extents are located so that I could sort a set of
files by location and then access them in a more efficient order?

I know that there's more to reading a file than accessing its data extents.
But by the time I'm comparing files I have already lstat()'ed them all so
their inodes and directory paths are probably all still in the cache.

Thanks,
Phil

[-- Attachment #1.2: Type: text/html, Size: 1677 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: reordering file operations for performance
  2011-01-31  4:47 reordering file operations for performance Phil Karn
@ 2011-01-31  5:54 ` Dave Chinner
  0 siblings, 0 replies; 2+ messages in thread
From: Dave Chinner @ 2011-01-31  5:54 UTC (permalink / raw)
  To: karn; +Cc: xfs

On Sun, Jan 30, 2011 at 08:47:03PM -0800, Phil Karn wrote:
> I have written a file deduplicator, dupmerge, that walks through a file
> system (or reads a list of files from stdin), sorts them by size, and
> compares each pair of the same size looking for duplicates. When it finds
> two distinct files with identical contents on the same file system, it
> deletes the newer copy and recreates its path name as a hard link to the
> older version.
> 
> For performance it actually compares SHA1 hashes, not the actual file
> contents. To avoid unnecessary full-file reads, it first compares the hashes
> of the first pages (4kiB) of each file. Only if they match will I compute
> and compare the full file hashes. Each file is fully read at most once and
> sequentially, so if the file occupies a single extent it can be read in a
> single large contiguous transfer. This is noticeably faster than doing a
> direct compare, seeking between two files at opposite ends of the disk.
> 
> I am looking for additional performance enhancements, and I don't mind using
> fs-specific features. E.g., I am now stashing the file hashes into xfs
> extended file attributes.
> 
> I regularly run xfs_fsr and have added fallocate() calls to the major file
> copy utilities, so all of my files are in single extents. Is there an easy
> way to ask xfs where those extents are located so that I could sort a set of
> files by location and then access them in a more efficient order?

ioctl(FS_IOC_FIEMAP) is what you want.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2011-01-31  5:52 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-31  4:47 reordering file operations for performance Phil Karn
2011-01-31  5:54 ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox