All of lore.kernel.org
 help / color / mirror / Atom feed
* Can I get a checksum for a file from btrfs (without reading the whole file)?
@ 2015-02-05 10:40 Lutz Vieweg
  2015-02-06  5:20 ` Qu Wenruo
  0 siblings, 1 reply; 4+ messages in thread
From: Lutz Vieweg @ 2015-02-05 10:40 UTC (permalink / raw)
  To: linux-btrfs

Hi,

use case: You have two huge files on a btrfs, you assume they contain the same bytes,
but you do not know for sure.

Is there a way to get a checksum of both files from btrfs with less effort than
reading the whole of both files and computing a hash sum?

(I was thinking that the btrfs-internal CRCs might be of use, here...)

Regards,

Lutz Vieweg


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Can I get a checksum for a file from btrfs (without reading the whole file)?
  2015-02-05 10:40 Can I get a checksum for a file from btrfs (without reading the whole file)? Lutz Vieweg
@ 2015-02-06  5:20 ` Qu Wenruo
  2015-02-06 13:00   ` Lutz Vieweg
  0 siblings, 1 reply; 4+ messages in thread
From: Qu Wenruo @ 2015-02-06  5:20 UTC (permalink / raw)
  To: Lutz Vieweg, linux-btrfs


-------- Original Message --------
Subject: Can I get a checksum for a file from btrfs (without reading the 
whole file)?
From: Lutz Vieweg <lvml@5t9.de>
To: <linux-btrfs@vger.kernel.org>
Date: 2015年02月05日 18:40
> Hi,
>
> use case: You have two huge files on a btrfs, you assume they contain 
> the same bytes,
> but you do not know for sure.
>
> Is there a way to get a checksum of both files from btrfs with less 
> effort than
> reading the whole of both files and computing a hash sum?
For short, NO.

For long:
For current implement, btrfs use calculate 4K sector into 4bytes(32bit) 
crc32 and restore it into csum tree.

So, for large files, e.g. 1G(already quite small for modern storage), 
its checksum will be 1M in size.
Which means even using crc32 (same as kernel and crc32(a+b) = crc32(a) + 
crc32(b)), you still needs to
do crc32 on the all 1M crc32.
And if you want other checksum like md5/sha256, you have no choice but 
read them all and calculate.


But there is still some case btrfs can help you determine whether the 
files are the same in a faster way.
Prerequisite:
The two files are copied using clone(cp --reflink command) or 
deduplicated(see btrfs 
wiki:https://btrfs.wiki.kernel.org/index.php/Deduplication)

Method:
If cloned/deduplicated, file will share same file extents (one can up to 
128M).
So you can compare file extents to compare the whole file.
Per 128M compare will be definitely faster. (if not modified after cp 
--clone or deduplication)

I didn't see such implement yet, so it's just a concept...

Thanks,
Qu
>
> (I was thinking that the btrfs-internal CRCs might be of use, here...)
>
> Regards,
>
> Lutz Vieweg
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Can I get a checksum for a file from btrfs (without reading the whole file)?
  2015-02-06  5:20 ` Qu Wenruo
@ 2015-02-06 13:00   ` Lutz Vieweg
  2015-02-06 18:04     ` David Sterba
  0 siblings, 1 reply; 4+ messages in thread
From: Lutz Vieweg @ 2015-02-06 13:00 UTC (permalink / raw)
  To: linux-btrfs

On 02/06/2015 06:20 AM, Qu Wenruo wrote:
> From: Lutz Vieweg <lvml@5t9.de>
>> use case: You have two huge files on a btrfs, you assume they contain the same bytes,
>> but you do not know for sure.
>>
>> Is there a way to get a checksum of both files from btrfs with less effort than
>> reading the whole of both files and computing a hash sum?
> For short, NO.
>
> For long:
> For current implement, btrfs use calculate 4K sector into 4bytes(32bit) crc32 and restore it into
> csum tree.
>
> So, for large files, e.g. 1G(already quite small for modern storage), its checksum will be 1M in size.
> Which means even using crc32 (same as kernel and crc32(a+b) = crc32(a) + crc32(b)), you still needs to
> do crc32 on the all 1M crc32.

And yet, having to read only 1 MB checksums instead of 1 GB data sounds
like a good deal - is there some userspace interface allowing to read
(only) those per-4k checksums for a file?

> But there is still some case btrfs can help you determine whether the files are the same in a faster
> way.
> Prerequisite:
> The two files are copied using clone(cp --reflink command) or deduplicated

In my case I know for sure that no cloning/deduplication happened when
the files were written.

Regards,

Lutz Vieweg


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Can I get a checksum for a file from btrfs (without reading the whole file)?
  2015-02-06 13:00   ` Lutz Vieweg
@ 2015-02-06 18:04     ` David Sterba
  0 siblings, 0 replies; 4+ messages in thread
From: David Sterba @ 2015-02-06 18:04 UTC (permalink / raw)
  To: Lutz Vieweg; +Cc: linux-btrfs

On Fri, Feb 06, 2015 at 02:00:53PM +0100, Lutz Vieweg wrote:
> On 02/06/2015 06:20 AM, Qu Wenruo wrote:
> > From: Lutz Vieweg <lvml@5t9.de>
> >> use case: You have two huge files on a btrfs, you assume they contain the same bytes,
> >> but you do not know for sure.
> >>
> >> Is there a way to get a checksum of both files from btrfs with less effort than
> >> reading the whole of both files and computing a hash sum?
> > For short, NO.
> >
> > For long:
> > For current implement, btrfs use calculate 4K sector into 4bytes(32bit) crc32 and restore it into
> > csum tree.
> >
> > So, for large files, e.g. 1G(already quite small for modern storage), its checksum will be 1M in size.
> > Which means even using crc32 (same as kernel and crc32(a+b) = crc32(a) + crc32(b)), you still needs to
> > do crc32 on the all 1M crc32.
> 
> And yet, having to read only 1 MB checksums instead of 1 GB data sounds
> like a good deal - is there some userspace interface allowing to read
> (only) those per-4k checksums for a file?

Just a POC code how to get the csum for a given block (based on the
SEARCH ioctl, needs root):

http://repo.or.cz/w/btrfs-progs-unstable/devel.git/commit/33a4d171552736da2977323797f53d9cea830e2f

crc32 is weak but can be used to detect early(-ier) if the files are
different. A hash collision in the middle of huge files is possible but
I guess very low.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-02-06 18:04 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-05 10:40 Can I get a checksum for a file from btrfs (without reading the whole file)? Lutz Vieweg
2015-02-06  5:20 ` Qu Wenruo
2015-02-06 13:00   ` Lutz Vieweg
2015-02-06 18:04     ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.