Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Al <6401e46d@opayq.com>
To: linux-btrfs@vger.kernel.org
Subject: Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls.
Date: Sat, 16 Jan 2016 12:27:16 +0000 (UTC)	[thread overview]
Message-ID: <loom.20160116T132316-196@post.gmane.org> (raw)

Hi,

This must be a silly question! Please assume that I know not much more than
nothing abou*t fs. 
I know dedup is traditionally costs a lot of memory, but I don't really
understand why it is done like that. Let me explain my question:

AFAICT dedup matches file level chunks (or whatever you call them) using a
hash function or something which has limited collision potential. The hash
is used to match blocks as they are committed to disk, I'm talking online
dedup*, and reflink/eliminate the duplicated blocks as necessary.  This
bloody great hash tree is saved in memory for speed of lookup (I assume).

But why?

Is there any urgency for dedup? What's wrong with storing the hash on disk
with the block and having a separate process dedup the written data over
time; dedup'ing data immediately when written to high-write-count data is
counter productive because no sooner has it been deduped then it is rendered
obsolete by another COW write.

There's also the problem of opening a potential problem window before the
commit to disk, hopefully covered by the journal, whilst we seek the
relevant duplicate if there is one.

Help me out peeps? Why is there a such an urgency to have online dedup,
rather than a triggered/delayed dedup, similar the current autodefrag process?

Thank you. I'm sure the answer is obvious, but not to me!

* dedup/dedupe/deduplication

next             reply	other threads:[~2016-01-16 12:27 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-16 12:27 Al [this message]
2016-01-16 14:10 ` Why is dedup inline, not delayed (as opposed to offline)? Explain like I'm five pls Duncan
2016-01-16 18:07   ` Rich Freeman
2016-01-18 12:23     ` Austin S. Hemmelgarn
2016-01-23 22:22       ` Mark Fasheh
2016-01-20 14:49     ` Al
2016-01-20 14:43   ` Al
2016-01-21  8:23     ` Qu Wenruo
2016-01-21 14:53       ` Al
2016-01-21 17:23         ` Chris Murphy
2016-01-22 11:33           ` Al
2016-01-23  2:44             ` Chris Murphy
2016-02-02  2:55             ` Qu Wenruo
2016-01-18  1:36 ` Qu Wenruo
2016-01-18  3:10   ` Duncan
2016-01-18  3:16     ` Qu Wenruo
2016-01-18  3:51       ` Duncan
2016-01-18 12:48         ` Austin S. Hemmelgarn
2016-01-19  8:30           ` Duncan
2016-01-19  9:14             ` Duncan
2016-01-19 12:28               ` Austin S. Hemmelgarn
2016-01-19 15:40                 ` Duncan
2016-01-20  8:32                 ` Brendan Hide
2016-01-19 12:21             ` Austin S. Hemmelgarn
2016-01-20 15:12               ` Al
2016-01-20 18:21                 ` Duncan
2016-01-20 14:53   ` Al

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=loom.20160116T132316-196@post.gmane.org \
    --to=6401e46d@opayq.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).