All of lore.kernel.org
 help / color / mirror / Atom feed
From: Austin S Hemmelgarn <ahferroin7@gmail.com>
To: Timofey Titovets <nefelim4ag@gmail.com>, linux-btrfs@vger.kernel.org
Subject: Re: Btrfs offline deduplication
Date: Fri, 01 Aug 2014 06:17:44 -0400	[thread overview]
Message-ID: <53DB6948.3000009@gmail.com> (raw)
In-Reply-To: <CAGqmi747VW41DOPR-uev5cDaYtJ=FzzMCVfNNqtF7OUwB1jLjg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1913 bytes --]

On 07/31/2014 07:54 PM, Timofey Titovets wrote:
> Good time of day.
> I have several questions about data deduplication on btrfs.
> Sorry if i ask stupid questions or waste you time %)
> 
> What about implementation of offline data deduplication? I don't see
> any activity on this place, may be i need to ask a particular person?
> Where the problem? May be a can i try to help (testing as example)?
> 
> I could be wrong, but as i understand btrfs store crc32 checksum one
> per file, if this is true, may be make a sense to create small worker
> for dedup files? Like worker for autodefrag?
> With simple logic like:
> if sum1 == sum2 && file_size1 == file_size2; then
> if (bit_to_bit_identical(file1,2)); then merge(file1, file2);
> This can be first attempt to implement per file offline dedup
> What you think about it? could i be wrong? or this is a horrible crutch?
> (as i understand it not change format of fs)
> 
> (bedup and other tools, its cool, but have several problem with these
> tools and i think, what kernel implementation can work better).
> 
I think there may be some misunderstandings here about some of the
internals of BTRFS.  First of all, checksums are stored per block, not
per file, and secondly, deduplication can be done on a much finer scale
than individual files (you can deduplicate individual extents).

I do think however that having the option of a background thread doing
deduplication asynchronously is a good idea, but then you would have to
have some way to trigger it on individual files/trees, and triggering on
writes like the autodefrag thread does doesn't make much sense.  Having
some userspace program to tell it to run on a given set of files would
probably be the best approach for a trigger.  I don't remember if this
kind of thing was also included in the online deduplication patches that
got posted a while back or not.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]

  reply	other threads:[~2014-08-01 10:17 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-31 23:54 Btrfs offline deduplication Timofey Titovets
2014-08-01 10:17 ` Austin S Hemmelgarn [this message]
2014-08-01 13:23   ` David Sterba
2014-08-01 14:16     ` Austin S Hemmelgarn
2014-08-01 18:55       ` Mark Fasheh
2014-08-01 19:18         ` Austin S Hemmelgarn
2014-08-01 20:18           ` Mark Fasheh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53DB6948.3000009@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=nefelim4ag@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.