linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Manual deduplication would be useful
@ 2013-07-23 15:47 Rick van Rein
  2013-07-23 16:06 ` cwillu
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Rick van Rein @ 2013-07-23 15:47 UTC (permalink / raw)
  To: linux-btrfs

Hello,

For over a year now, I've been experimenting with stacked filesystems as a way to save on resources.  A basic OS layer is shared among Containers, each of which stacks a layer with modifications on top of it.  This approach means that Containers share buffer cache and loaded executables.  Concrete technology choices aside, the result is rock-solid and the efficiency improvements are incredible, as documented here:

http://rickywiki.vanrein.org/doku.php?id=openvz-aufs

One problem with this setup is updating software.  In lieu of stacking-support in package managers, it is necessary to do this on a per-Container basis, meaning that each installs their own versions, including overwrites of the basic OS layer.  Deduplication could remedy this, but the generic mechanism is known from ZFS to be fairly inefficient.

Interestingly however, this particular use case demonstrates that a much simpler deduplication mechanism than normally considered could be useful.  It would suffice if the filesystem could check on manual hints, or stack-specifying hints, to see if overlaid files share the same file contents; when they do, deduplication could commence.  This saves searching through the entire filesystem for every file or block written.  It might also mean that the actual stacking is not needed, but instead a basic OS could be cloned to form a new basic install, and kept around for this hint processing.

I'm not sure if this should ideally be implemented inside the stacking approach (where it would be stacking-implementation-specific) or in the filesystem (for which it might be too far off the main purpose) but I thought it wouldn't hurt to start a discussion on it, given that (1) filesystems nowadays service multiple instances, (2) filesystems like Btrfs are based on COW, and (3) deduplication is a goal but the generic mechanism could use some efficiency improvements.

I hope having seen this approach is useful to you!

Please reply-all?  I'm not on this list.

Cheers,
 -Rick

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Manual deduplication would be useful
  2013-07-23 15:47 Manual deduplication would be useful Rick van Rein
@ 2013-07-23 16:06 ` cwillu
  2013-07-23 17:25 ` Gabriel de Perthuis
  2013-07-23 21:40 ` Rick van Rein
  2 siblings, 0 replies; 4+ messages in thread
From: cwillu @ 2013-07-23 16:06 UTC (permalink / raw)
  To: Rick van Rein; +Cc: linux-btrfs

On Tue, Jul 23, 2013 at 9:47 AM, Rick van Rein <rick@vanrein.org> wrote:
> Hello,
>
> For over a year now, I've been experimenting with stacked filesystems as a way to save on resources.  A basic OS layer is shared among Containers, each of which stacks a layer with modifications on top of it.  This approach means that Containers share buffer cache and loaded executables.  Concrete technology choices aside, the result is rock-solid and the efficiency improvements are incredible, as documented here:
>
> http://rickywiki.vanrein.org/doku.php?id=openvz-aufs
>
> One problem with this setup is updating software.  In lieu of stacking-support in package managers, it is necessary to do this on a per-Container basis, meaning that each installs their own versions, including overwrites of the basic OS layer.  Deduplication could remedy this, but the generic mechanism is known from ZFS to be fairly inefficient.
>
> Interestingly however, this particular use case demonstrates that a much simpler deduplication mechanism than normally considered could be useful.  It would suffice if the filesystem could check on manual hints, or stack-specifying hints, to see if overlaid files share the same file contents; when they do, deduplication could commence.  This saves searching through the entire filesystem for every file or block written.  It might also mean that the actual stacking is not needed, but instead a basic OS could be cloned to form a new basic install, and kept around for this hint processing.
>
> I'm not sure if this should ideally be implemented inside the stacking approach (where it would be stacking-implementation-specific) or in the filesystem (for which it might be too far off the main purpose) but I thought it wouldn't hurt to start a discussion on it, given that (1) filesystems nowadays service multiple instances, (2) filesystems like Btrfs are based on COW, and (3) deduplication is a goal but the generic mechanism could use some efficiency improvements.
>
> I hope having seen this approach is useful to you!
>
> Please reply-all?  I'm not on this list.
>
> Cheers,
>  -Rick--
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

There's patches providing offline dedup (i.e., manually telling the
kernel which files to consider) floating around:
http://lwn.net/Articles/547542/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Manual deduplication would be useful
  2013-07-23 15:47 Manual deduplication would be useful Rick van Rein
  2013-07-23 16:06 ` cwillu
@ 2013-07-23 17:25 ` Gabriel de Perthuis
  2013-07-23 21:40 ` Rick van Rein
  2 siblings, 0 replies; 4+ messages in thread
From: Gabriel de Perthuis @ 2013-07-23 17:25 UTC (permalink / raw)
  To: Rick van Rein; +Cc: linux-btrfs, cwillu, Mark Fasheh

> Hello,
> 
> For over a year now, I've been experimenting with stacked filesystems
> as a way to save on resources.  A basic OS layer is shared among
> Containers, each of which stacks a layer with modifications on top of
> it.  This approach means that Containers share buffer cache and
> loaded executables.  Concrete technology choices aside, the result is
> rock-solid and the efficiency improvements are incredible, as
> documented here:
> 
> http://rickywiki.vanrein.org/doku.php?id=openvz-aufs
> 
> One problem with this setup is updating software.  In lieu of
> stacking-support in package managers, it is necessary to do this on a
> per-Container basis, meaning that each installs their own versions,
> including overwrites of the basic OS layer.  Deduplication could
> remedy this, but the generic mechanism is known from ZFS to be fairly
> inefficient.
> 
> Interestingly however, this particular use case demonstrates that a
> much simpler deduplication mechanism than normally considered could
> be useful.  It would suffice if the filesystem could check on manual
> hints, or stack-specifying hints, to see if overlaid files share the
> same file contents; when they do, deduplication could commence.  This
> saves searching through the entire filesystem for every file or block
> written.  It might also mean that the actual stacking is not needed,
> but instead a basic OS could be cloned to form a new basic install,
> and kept around for this hint processing.
> 
> I'm not sure if this should ideally be implemented inside the
> stacking approach (where it would be
> stacking-implementation-specific) or in the filesystem (for which it
> might be too far off the main purpose) but I thought it wouldn't hurt
> to start a discussion on it, given that (1) filesystems nowadays
> service multiple instances, (2) filesystems like Btrfs are based on
> COW, and (3) deduplication is a goal but the generic mechanism could
> use some efficiency improvements.
> 
> I hope having seen this approach is useful to you!

Have a look at bedup[1] (disclaimer: I wrote it).  The normal mode
does incremental scans, and there's also a subcommand for
deduplicating files that you already know are identical:
  bedup dedup-files

The implementation in master uses a clone ioctl.  Here is Mark
Fasheh's latest patch series to implement a dedup ioctl[2]; it
also comes with a command to work on listed files
(btrfs-extent-same in [3]).

[1] https://github.com/g2p/bedup
[2] http://comments.gmane.org/gmane.comp.file-systems.btrfs/26310/
[3] https://github.com/markfasheh/duperemove

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Manual deduplication would be useful
  2013-07-23 15:47 Manual deduplication would be useful Rick van Rein
  2013-07-23 16:06 ` cwillu
  2013-07-23 17:25 ` Gabriel de Perthuis
@ 2013-07-23 21:40 ` Rick van Rein
  2 siblings, 0 replies; 4+ messages in thread
From: Rick van Rein @ 2013-07-23 21:40 UTC (permalink / raw)
  To: linux-btrfs

Hi Cwilu and Gabriel,

I wasn't aware that work was already being done.  I actually imagined having to defend what I brougt up :-)

What you sent looks interesting and useful, especially the support in userspace.  I will investigate these tools!

Till then -- thanks!
 -Rick


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-07-23 21:40 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-07-23 15:47 Manual deduplication would be useful Rick van Rein
2013-07-23 16:06 ` cwillu
2013-07-23 17:25 ` Gabriel de Perthuis
2013-07-23 21:40 ` Rick van Rein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).