Re: Offline Deduplication for Btrfs

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Peter A <loony@loonybin.org>
To: linux-btrfs@vger.kernel.org
Subject: Re: Offline Deduplication for Btrfs
Date: Thu, 6 Jan 2011 09:52:50 -0500	[thread overview]
Message-ID: <201101060952.50500.loony@loonybin.org> (raw)
In-Reply-To: <4D25CB0F.2060002@bobich.net>

On Thursday, January 06, 2011 09:00:47 am you wrote:
> Peter A wrote:
> > I'm saying in a filesystem it doesn't matter - if you bundle everything
> > into a backup stream, it does. Think of tar. 512 byte allignment. I tar
> > up a directory with 8TB total size. No big deal. Now I create a new,
> > empty file in this dir with a name that just happens to be the first in
> > the dir. This adds 512 bytes close to the beginning of the tar file the
> > second time I run tar. Now the remainder of the is all offset by
> > 512bytes and, if you do dedupe on fs- block sized chunks larger than the
> > 512bytes, not a single byte will be de- duped.
> 
> OK, I get what you mean now. And I don't think this is something that
> should be solved in the file system.
<snip>
> Whether than is a worthwhile thing to do for poorly designed backup
> solutions, but I'm not convinced about the general use-case. It'd be
> very expensive and complicated for seemingly very limited benefit.
Glad I finally explained myself properly... Unfortunately I disagree with you 
on the rest. If you take that logic, then I could claim dedupe is nothing a 
file system should handle - after all, its the user's poorly designed 
applications that store multiple copies of data. Why should the fs take care 
of that? 

The problem doesn't just affect backups. It affects everything where you have 
large data files that are not forced to allign with filesystem blocks. In 
addition to the case I mentioned above this affects in pretty much the same 
effectiveness:
* Database dumps 
* Video Editing 
* Files backing iSCSI volumes
* VM Images (fs blocks inside the VM rarely align with fs blocks in the 
backing storage). Our VM environment is backed with a 7410 and we get only 
about 10% dedupe. Copying the same images to a DataDomain results in a 60% 
reduction in space used.

Basically, every time I end up using a lot of storage space, its in a scenario 
where fs-block based dedupe is not very effective.

I also have to argue the point that these usages are "poorly designed". Poorly 
designed can only apply to technologies that existed or were talked about at 
the time the design was made. Tar and such have been around for a long time, 
way before anyone even though of dedupe. In addition, until there is a 
commonly accepted/standard API to query the block size so apps can generate 
files appropriately laid out for the backing filesystem, what is the application 
supposed to do? 
If anything, I would actually argue the opposite, that fixed block dedupe is a 
poor design:
* The problem is known at the time the design was made
* No alternative can be offered as tar, netbackup, video editing, ... has been 
around for a long time and is unlikely to change in the near future
* There is no standard API to query the allignment parameters (and even that 
would not be great since copying a file alligned for 8k to a 16k alligned 
filesystem, would potentially cause the same issue again)

Also from the human perspective its hard to make end users understand your 
point of view. I promote the 7000 series of storage and I know how hard it is 
to explain the dedupe behavior there. They see that Datadomain does it, and 
does it well. So why can't solution xyz do it just as good?

> Typical. And no doubt they complain that ZFS isn't doing what they want,
> rather than netbackup not co-operating. The solution to one misdesign
> isn't an expensive bodge. The solution to this particular problem is to
> make netbackup work on per-file rather than per stream basis.
I'd agree if it was just limited to netbackup... I know variable block length 
is a significantly more difficult problem than block level. That's why the ZFS 
team made the design choice they did. Variable length is also the reason why 
the DataDomain solution is a scale out rather than scalue up approach. 
However, CPUs get faster and faster - eventually they'll be able to handle it. 
So the right solution (from my limited point of view, as I said, I'm not a 
filesystem design expert) would be to implement the data structures to handle 
variable length. Then in the first iteration, implement the dedupe algorithm to 
only search on filesystem blocks using existing checksums and such. Less CPU 
usage, quicker development, easier debugging. Once that is stable and proven, 
you can then without requiring the user to reformat, go ahead and implement 
variable length dedupe...

Btw, thanks for your time, Gordan :)

Peter.

-- 
Censorship: noun, circa 1591. a: Relief of the burden of independent thinking.

next prev parent reply	other threads:[~2011-01-06 14:52 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-01-05 16:36 Offline Deduplication for Btrfs Josef Bacik
2011-01-05 16:36 ` [PATCH] Btrfs: add extent-same ioctl for dedup Josef Bacik
2011-01-05 17:50   ` Simon Farnsworth
2011-01-05 16:36 ` [PATCH] Btrfs-progs: add dedup functionality Josef Bacik
2011-01-05 17:42 ` Offline Deduplication for Btrfs Gordan Bobic
2011-01-05 18:41   ` Diego Calleja
2011-01-05 19:01     ` Ray Van Dolson
2011-01-05 20:27       ` Gordan Bobic
2011-01-05 20:28       ` Josef Bacik
2011-01-05 20:25     ` Gordan Bobic
2011-01-05 21:14       ` Diego Calleja
2011-01-05 21:21         ` Gordan Bobic
2011-01-05 19:46   ` Josef Bacik
2011-01-05 19:58     ` Lars Wirzenius
2011-01-05 20:15       ` Josef Bacik
2011-01-05 20:34         ` Freddie Cash
2011-01-05 21:07       ` Lars Wirzenius
2011-01-05 20:12     ` Freddie Cash
2011-01-05 20:46     ` Gordan Bobic
     [not found]       ` <4D250B3C.6010708@shiftmail.org>
2011-01-06  1:03         ` Gordan Bobic
2011-01-06  1:56           ` Spelic
2011-01-06 10:39             ` Gordan Bobic
2011-01-06  3:33           ` Freddie Cash
2011-01-06  1:19       ` Spelic
2011-01-06  3:58         ` Peter A
2011-01-06 10:48           ` Gordan Bobic
2011-01-06 13:33             ` Peter A
2011-01-06 14:00               ` Gordan Bobic
2011-01-06 14:52                 ` Peter A [this message]
2011-01-06 15:07                   ` Gordan Bobic
2011-01-06 16:11                     ` Peter A
2011-01-06 18:35           ` Chris Mason
2011-01-08  0:27             ` Peter A
2011-01-06 14:30         ` Tomasz Torcz
2011-01-06 14:49           ` Gordan Bobic
2011-01-06  1:29   ` Chris Mason
2011-01-06 10:33     ` Gordan Bobic
2011-01-10 15:28     ` Ric Wheeler
2011-01-10 15:37       ` Josef Bacik
2011-01-10 15:39         ` Chris Mason
2011-01-10 15:43           ` Josef Bacik
2011-01-06 12:18   ` Simon Farnsworth
2011-01-06 12:29     ` Gordan Bobic
2011-01-06 13:30       ` Simon Farnsworth
2011-01-06 14:20     ` Ondřej Bílka
2011-01-06 14:41       ` Gordan Bobic
2011-01-06 15:37         ` Ondřej Bílka
2011-01-06  8:25 ` Yan, Zheng 
  -- strict thread matches above, loose matches on Subject: below --
2011-01-06  9:37 Tomasz Chmielewski
2011-01-06  9:51 ` Mike Hommey
2011-01-06 16:57   ` Hubert Kario
2011-01-06 10:52 ` Gordan Bobic
2011-01-16  0:18 Arjen Nienhuis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201101060952.50500.loony@loonybin.org \
    --to=loony@loonybin.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).