Linux Btrfs filesystem development
 help / color / mirror / Atom feed
From: Lutz Vieweg <lvml@5t9.de>
To: linux-btrfs@vger.kernel.org
Subject: Re: Can I get a checksum for a file from btrfs (without reading the whole file)?
Date: Fri, 06 Feb 2015 14:00:53 +0100	[thread overview]
Message-ID: <mb2du5$esf$1@ger.gmane.org> (raw)
In-Reply-To: <54D44F2A.8090108@cn.fujitsu.com>

On 02/06/2015 06:20 AM, Qu Wenruo wrote:
> From: Lutz Vieweg <lvml@5t9.de>
>> use case: You have two huge files on a btrfs, you assume they contain the same bytes,
>> but you do not know for sure.
>>
>> Is there a way to get a checksum of both files from btrfs with less effort than
>> reading the whole of both files and computing a hash sum?
> For short, NO.
>
> For long:
> For current implement, btrfs use calculate 4K sector into 4bytes(32bit) crc32 and restore it into
> csum tree.
>
> So, for large files, e.g. 1G(already quite small for modern storage), its checksum will be 1M in size.
> Which means even using crc32 (same as kernel and crc32(a+b) = crc32(a) + crc32(b)), you still needs to
> do crc32 on the all 1M crc32.

And yet, having to read only 1 MB checksums instead of 1 GB data sounds
like a good deal - is there some userspace interface allowing to read
(only) those per-4k checksums for a file?

> But there is still some case btrfs can help you determine whether the files are the same in a faster
> way.
> Prerequisite:
> The two files are copied using clone(cp --reflink command) or deduplicated

In my case I know for sure that no cloning/deduplication happened when
the files were written.

Regards,

Lutz Vieweg


  reply	other threads:[~2015-02-06 13:01 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-05 10:40 Can I get a checksum for a file from btrfs (without reading the whole file)? Lutz Vieweg
2015-02-06  5:20 ` Qu Wenruo
2015-02-06 13:00   ` Lutz Vieweg [this message]
2015-02-06 18:04     ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='mb2du5$esf$1@ger.gmane.org' \
    --to=lvml@5t9.de \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox