From: Gabriel de Perthuis <g2p.code@gmail.com>
To: Rick van Rein <rick@vanrein.org>
Cc: linux-btrfs@vger.kernel.org, cwillu@cwillu.com,
Mark Fasheh <mfasheh@suse.de>
Subject: Re: Manual deduplication would be useful
Date: Tue, 23 Jul 2013 19:25:59 +0200 [thread overview]
Message-ID: <51EEBCA7.3080802@gmail.com> (raw)
In-Reply-To: <3DF45F2F-A56D-4302-AB84-31A6A3084A39@vanrein.org>
> Hello,
>
> For over a year now, I've been experimenting with stacked filesystems
> as a way to save on resources. A basic OS layer is shared among
> Containers, each of which stacks a layer with modifications on top of
> it. This approach means that Containers share buffer cache and
> loaded executables. Concrete technology choices aside, the result is
> rock-solid and the efficiency improvements are incredible, as
> documented here:
>
> http://rickywiki.vanrein.org/doku.php?id=openvz-aufs
>
> One problem with this setup is updating software. In lieu of
> stacking-support in package managers, it is necessary to do this on a
> per-Container basis, meaning that each installs their own versions,
> including overwrites of the basic OS layer. Deduplication could
> remedy this, but the generic mechanism is known from ZFS to be fairly
> inefficient.
>
> Interestingly however, this particular use case demonstrates that a
> much simpler deduplication mechanism than normally considered could
> be useful. It would suffice if the filesystem could check on manual
> hints, or stack-specifying hints, to see if overlaid files share the
> same file contents; when they do, deduplication could commence. This
> saves searching through the entire filesystem for every file or block
> written. It might also mean that the actual stacking is not needed,
> but instead a basic OS could be cloned to form a new basic install,
> and kept around for this hint processing.
>
> I'm not sure if this should ideally be implemented inside the
> stacking approach (where it would be
> stacking-implementation-specific) or in the filesystem (for which it
> might be too far off the main purpose) but I thought it wouldn't hurt
> to start a discussion on it, given that (1) filesystems nowadays
> service multiple instances, (2) filesystems like Btrfs are based on
> COW, and (3) deduplication is a goal but the generic mechanism could
> use some efficiency improvements.
>
> I hope having seen this approach is useful to you!
Have a look at bedup[1] (disclaimer: I wrote it). The normal mode
does incremental scans, and there's also a subcommand for
deduplicating files that you already know are identical:
bedup dedup-files
The implementation in master uses a clone ioctl. Here is Mark
Fasheh's latest patch series to implement a dedup ioctl[2]; it
also comes with a command to work on listed files
(btrfs-extent-same in [3]).
[1] https://github.com/g2p/bedup
[2] http://comments.gmane.org/gmane.comp.file-systems.btrfs/26310/
[3] https://github.com/markfasheh/duperemove
next prev parent reply other threads:[~2013-07-23 17:26 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-07-23 15:47 Manual deduplication would be useful Rick van Rein
2013-07-23 16:06 ` cwillu
2013-07-23 17:25 ` Gabriel de Perthuis [this message]
2013-07-23 21:40 ` Rick van Rein
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51EEBCA7.3080802@gmail.com \
--to=g2p.code@gmail.com \
--cc=cwillu@cwillu.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=mfasheh@suse.de \
--cc=rick@vanrein.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).