Re: Online Deduplication for Btrfs (Master's thesis)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Chris Mason <chris.mason@fusionio.com>
To: Alexander Block <ablock84@gmail.com>
Cc: "Martin Křížek" <martin.krizek@gmail.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
	"lczerner@redhat.com" <lczerner@redhat.com>
Subject: Re: Online Deduplication for Btrfs (Master's thesis)
Date: Mon, 17 Dec 2012 20:31:11 -0500	[thread overview]
Message-ID: <20121218013111.GB22912@shiny> (raw)
In-Reply-To: <CAB9VWqCjoFGdZVx0Uup2zmvuqS2PHiQO=h2EKawH-dZiGqvfDQ@mail.gmail.com>

On Mon, Dec 17, 2012 at 06:33:24AM -0700, Alexander Block wrote:
> I did some research on deduplication in the past and there are some
> problems that you will face. I'll try to list some of them (for sure
> not all).

Thanks Alexander for writing all of this up.  There are a lot of great
points here, but I'll summarize with:

[ many challenges to online dedup ]

[ offline dedup is the best way ]

So, the big problem with offline dedup is you're suddenly read bound.  I
don't disagree that offline makes a lot of the dedup problems easier,
and Alexander describes a very interesting system here.

I've tried to avoid features that rely on scanning though, just because
idle disk time may not really exist.  But with scrub, we have the scan
as a feature, and it may make a lot of sense to leverage that.

online dedup has a different set of tradeoffs, but as Alexander says the
hard part really is the data structure to index the hashes.  I think
there are a few different options here, including changing the file
extent pointers to point to a sha instead of a logical disk offset.

So, part of my answer really depends on where you want to go with your
thesis.  I expect the data structure work for efficient hash lookup is
going to be closer to what your course work requires?

-chris

next prev parent reply	other threads:[~2012-12-18  1:31 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-12-17 12:05 Online Deduplication for Btrfs (Master's thesis) Martin Křížek
2012-12-17 13:12 ` Hubert Kario
2012-12-19 16:58   ` Martin Křížek
2012-12-17 13:33 ` Alexander Block
2012-12-18  1:31   ` Chris Mason [this message]
2013-01-07 17:27     ` Martin Křížek
2013-03-17 22:57       ` Martin Křížek
2012-12-19 17:40   ` Martin Křížek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121218013111.GB22912@shiny \
    --to=chris.mason@fusionio.com \
    --cc=ablock84@gmail.com \
    --cc=lczerner@redhat.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=martin.krizek@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.