From: Gabriel de Perthuis <g2p.code@gmail.com>
To: Rick van Rein <rick@vanrein.org>
Cc: linux-btrfs@vger.kernel.org, cwillu@cwillu.com,
Mark Fasheh <mfasheh@suse.de>
Subject: Re: Manual deduplication would be useful
Date: Tue, 23 Jul 2013 19:25:59 +0200 [thread overview]
Message-ID: <51EEBCA7.3080802@gmail.com> (raw)
In-Reply-To: <3DF45F2F-A56D-4302-AB84-31A6A3084A39@vanrein.org>
> Hello,
>
> For over a year now, I've been experimenting with stacked filesystems
> as a way to save on resources. A basic OS layer is shared among
> Containers, each of which stacks a layer with modifications on top of
> it. This approach means that Containers share buffer cache and
> loaded executables. Concrete technology choices aside, the result is
> rock-solid and the efficiency improvements are incredible, as
> documented here:
>
> http://rickywiki.vanrein.org/doku.php?id=openvz-aufs
>
> One problem with this setup is updating software. In lieu of
> stacking-support in package managers, it is necessary to do this on a
> per-Container basis, meaning that each installs their own versions,
> including overwrites of the basic OS layer. Deduplication could
> remedy this, but the generic mechanism is known from ZFS to be fairly
> inefficient.
>
> Interestingly however, this particular use case demonstrates that a
> much simpler deduplication mechanism than normally considered could
> be useful. It would suffice if the filesystem could check on manual
> hints, or stack-specifying hints, to see if overlaid files share the
> same file contents; when they do, deduplication could commence. This
> saves searching through the entire filesystem for every file or block
> written. It might also mean that the actual stacking is not needed,
> but instead a basic OS could be cloned to form a new basic install,
> and kept around for this hint processing.
>
> I'm not sure if this should ideally be implemented inside the
> stacking approach (where it would be
> stacking-implementation-specific) or in the filesystem (for which it
> might be too far off the main purpose) but I thought it wouldn't hurt
> to start a discussion on it, given that (1) filesystems nowadays
> service multiple instances, (2) filesystems like Btrfs are based on
> COW, and (3) deduplication is a goal but the generic mechanism could
> use some efficiency improvements.
>
> I hope having seen this approach is useful to you!
Have a look at bedup[1] (disclaimer: I wrote it). The normal mode
does incremental scans, and there's also a subcommand for
deduplicating files that you already know are identical:
bedup dedup-files
The implementation in master uses a clone ioctl. Here is Mark
Fasheh's latest patch series to implement a dedup ioctl[2]; it
also comes with a command to work on listed files
(btrfs-extent-same in [3]).
[1] https://github.com/g2p/bedup
[2] http://comments.gmane.org/gmane.comp.file-systems.btrfs/26310/
[3] https://github.com/markfasheh/duperemove
next prev parent reply other threads:[~2013-07-23 17:26 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-07-23 15:47 Manual deduplication would be useful Rick van Rein
2013-07-23 16:06 ` cwillu
2013-07-23 17:25 ` Gabriel de Perthuis [this message]
2013-07-23 21:40 ` Rick van Rein
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51EEBCA7.3080802@gmail.com \
--to=g2p.code@gmail.com \
--cc=cwillu@cwillu.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=mfasheh@suse.de \
--cc=rick@vanrein.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.