* btrfs dedup - available or experimental? Or yet to be? @ 2015-03-23 23:10 Martin 2015-03-23 23:22 ` Hugo Mills 0 siblings, 1 reply; 14+ messages in thread From: Martin @ 2015-03-23 23:10 UTC (permalink / raw) To: linux-btrfs As titled: Does btrfs have dedup (on raid1 multiple disks) that can be enabled? Can anyone relate any experiences? Is there (or will there be,) a bad penalty of fragmentation? (For kernel 3.18.9) Thanks, Martin ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs dedup - available or experimental? Or yet to be? 2015-03-23 23:10 btrfs dedup - available or experimental? Or yet to be? Martin @ 2015-03-23 23:22 ` Hugo Mills 2015-03-25 1:30 ` Rich Freeman 2015-05-13 16:23 ` Learner Study 0 siblings, 2 replies; 14+ messages in thread From: Hugo Mills @ 2015-03-23 23:22 UTC (permalink / raw) To: Martin; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 672 bytes --] On Mon, Mar 23, 2015 at 11:10:46PM +0000, Martin wrote: > As titled: > > > Does btrfs have dedup (on raid1 multiple disks) that can be enabled? The current state of play is on the wiki: https://btrfs.wiki.kernel.org/index.php/Deduplication > Can anyone relate any experiences? duperemove is reported as working. > Is there (or will there be,) a bad penalty of fragmentation? With duperemove, it operates on an extent scale, not at the level of blocks, so the fragmentation isn't so bad. Hugo. -- Hugo Mills | ©1973 Unclear Research Ltd hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: 65E74AC0 | [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs dedup - available or experimental? Or yet to be? 2015-03-23 23:22 ` Hugo Mills @ 2015-03-25 1:30 ` Rich Freeman 2015-03-27 0:07 ` Martin 2015-03-27 20:44 ` Mark Fasheh 2015-05-13 16:23 ` Learner Study 1 sibling, 2 replies; 14+ messages in thread From: Rich Freeman @ 2015-03-25 1:30 UTC (permalink / raw) To: Hugo Mills, Martin, Btrfs BTRFS On Mon, Mar 23, 2015 at 7:22 PM, Hugo Mills <hugo@carfax.org.uk> wrote: > On Mon, Mar 23, 2015 at 11:10:46PM +0000, Martin wrote: >> As titled: >> >> >> Does btrfs have dedup (on raid1 multiple disks) that can be enabled? > > The current state of play is on the wiki: > > https://btrfs.wiki.kernel.org/index.php/Deduplication > I hadn't realized that bedup was deprecated. This seems unfortunate since it seemed to be a lot smarter about detecting what has and hasn't already been scanned, and it also supported defragmenting files while de-duplicating them. I'll give duperemove a shot. I just packaged it on Gentoo. -- Rich ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs dedup - available or experimental? Or yet to be? 2015-03-25 1:30 ` Rich Freeman @ 2015-03-27 0:07 ` Martin 2015-03-27 0:30 ` Rich Freeman 2015-03-27 20:51 ` Mark Fasheh 2015-03-27 20:44 ` Mark Fasheh 1 sibling, 2 replies; 14+ messages in thread From: Martin @ 2015-03-27 0:07 UTC (permalink / raw) To: linux-btrfs On 25/03/15 01:30, Rich Freeman wrote: > On Mon, Mar 23, 2015 at 7:22 PM, Hugo Mills <hugo@carfax.org.uk> wrote: >> On Mon, Mar 23, 2015 at 11:10:46PM +0000, Martin wrote: >>> As titled: >>> >>> >>> Does btrfs have dedup (on raid1 multiple disks) that can be enabled? >> >> The current state of play is on the wiki: >> >> https://btrfs.wiki.kernel.org/index.php/Deduplication >> > > I hadn't realized that bedup was deprecated. > > This seems unfortunate since it seemed to be a lot smarter about > detecting what has and hasn't already been scanned, and it also > supported defragmenting files while de-duplicating them. > > I'll give duperemove a shot. I just packaged it on Gentoo. Excellent and very rapid packaging, thanks! Already compiled, installed, and soon to be tried on a test subvolume... Anyone with any comments on how well duperemove performs for TB-sized volumes? Does it work across subvolumes? (Presumably not...) Thanks, Martin ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs dedup - available or experimental? Or yet to be? 2015-03-27 0:07 ` Martin @ 2015-03-27 0:30 ` Rich Freeman 2015-03-29 11:43 ` Kai Krakow 2015-03-27 20:51 ` Mark Fasheh 1 sibling, 1 reply; 14+ messages in thread From: Rich Freeman @ 2015-03-27 0:30 UTC (permalink / raw) To: Martin; +Cc: Btrfs BTRFS On Thu, Mar 26, 2015 at 8:07 PM, Martin <m_btrfs@ml1.co.uk> wrote: > > Anyone with any comments on how well duperemove performs for TB-sized > volumes? Took many hours but less than a day for a few TB - I'm not sure whether it is smart enough to take less time on subsequent scans like bedup. > > Does it work across subvolumes? (Presumably not...) As far as I can tell, yes. Unless you pass a command-line option it crosses filesystem boundaries and even scans non-btrfs filesystems (like /proc, /dev, etc). Obviously you'll want to avoid that since it only wastes time and I can just imagine it trying to hash kcore and such. Other than being less-than-ideal intelligence-wise, it seemed effective. I can live with that in an early release like this. -- Rich ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs dedup - available or experimental? Or yet to be? 2015-03-27 0:30 ` Rich Freeman @ 2015-03-29 11:43 ` Kai Krakow 2015-03-29 12:31 ` Rich Freeman 2015-03-29 17:51 ` Christoph Anton Mitterer 0 siblings, 2 replies; 14+ messages in thread From: Kai Krakow @ 2015-03-29 11:43 UTC (permalink / raw) To: linux-btrfs Rich Freeman <r-btrfs@thefreemanclan.net> schrieb: > On Thu, Mar 26, 2015 at 8:07 PM, Martin <m_btrfs@ml1.co.uk> wrote: >> >> Anyone with any comments on how well duperemove performs for TB-sized >> volumes? > > Took many hours but less than a day for a few TB - I'm not sure > whether it is smart enough to take less time on subsequent scans like > bedup. > >> >> Does it work across subvolumes? (Presumably not...) > > As far as I can tell, yes. Unless you pass a command-line option it > crosses filesystem boundaries and even scans non-btrfs filesystems > (like /proc, /dev, etc). Obviously you'll want to avoid that since it > only wastes time and I can just imagine it trying to hash kcore and > such. > > Other than being less-than-ideal intelligence-wise, it seemed > effective. I can live with that in an early release like this. This is mainly in there to support deduping across different subvolumes within the same device pool. So I think the idea was neither less-than- ideal, nor unintelligent, and it has nothing to do with performance. But your warning is still valid: One should take care not to "dedupe" special filesystems (but that is the same with every other tool out there, like rsync, cp, essentially everything that supports recursion), nor is it very effective for the deduplication process to cross a boundary to a non- btrfs device - for one or more exceptions: You may want duperemove to write hashes for a non-btrfs device and use the result for other purposes outside of duperemoves scope, or you are nesting btrfs into non-btrfs into btrfs mounts, or... Concluding that: duperemove should probably not try to become smart about filesystem boundaries. It should either cross them or not as it is now - the option is left to the user (as is the task to supply proper cmdline arguments with that). With the planned performance improvements, I'm guessing the best way will become mounting the root subvolume (subvolid 0) and letting duperemove work on that as a whole - including crossing all fs boundaries. -- Replies to list only preferred. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs dedup - available or experimental? Or yet to be? 2015-03-29 11:43 ` Kai Krakow @ 2015-03-29 12:31 ` Rich Freeman 2015-03-29 14:44 ` Kai Krakow 2015-03-29 17:51 ` Christoph Anton Mitterer 1 sibling, 1 reply; 14+ messages in thread From: Rich Freeman @ 2015-03-29 12:31 UTC (permalink / raw) To: Kai Krakow; +Cc: Btrfs BTRFS On Sun, Mar 29, 2015 at 7:43 AM, Kai Krakow <hurikhan77@gmail.com> wrote: > > With the planned performance improvements, I'm guessing the best way will > become mounting the root subvolume (subvolid 0) and letting duperemove work > on that as a whole - including crossing all fs boundaries. > Why cross filesystem boundaries by default? If you scan from the root subvolume you're guanteed to traverse every file on the filesystem (which is all that can be deduped) without crossing any filesystem boundaries. Even if you have btrfs on non-btrfs on btrfs there must be some other path that reaches the same files when scanning from subvolid 0. -- Rich ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs dedup - available or experimental? Or yet to be? 2015-03-29 12:31 ` Rich Freeman @ 2015-03-29 14:44 ` Kai Krakow 2015-03-29 17:54 ` Christoph Anton Mitterer 0 siblings, 1 reply; 14+ messages in thread From: Kai Krakow @ 2015-03-29 14:44 UTC (permalink / raw) To: linux-btrfs Rich Freeman <r-btrfs@thefreemanclan.net> schrieb: > On Sun, Mar 29, 2015 at 7:43 AM, Kai Krakow <hurikhan77@gmail.com> wrote: >> >> With the planned performance improvements, I'm guessing the best way will >> become mounting the root subvolume (subvolid 0) and letting duperemove >> work on that as a whole - including crossing all fs boundaries. >> > > Why cross filesystem boundaries by default? If you scan from the root > subvolume you're guanteed to traverse every file on the filesystem > (which is all that can be deduped) without crossing any filesystem > boundaries. Even if you have btrfs on non-btrfs on btrfs there must > be some other path that reaches the same files when scanning from > subvolid 0. Yes, the chosen "default" is probably not the best for this kind of utility. But I suppose it follows the principle of least surprise. At least every utility I'm daily using (like find) follows this default route. By the way, I wrote "default" because one should keep in mind that it is not recursive by default (and thus crossing the boundary wouldn't even apply in the default configuration) which only strengthens my point for the principle of least surprise. And I'd leave that open for discussion here to change the default, all I suggested was that duperemove should not try to become smart about it as the only choice (behavior will be undefined otherwise when deploying this on a vast amount of individually configured systems). I could image that there was a cmdline option to make it smart. The idea for subvolid 0: It is just pure intention how I would use it for my personal purpose. By no means this should be in any default deployments. -- Replies to list only preferred. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs dedup - available or experimental? Or yet to be? 2015-03-29 14:44 ` Kai Krakow @ 2015-03-29 17:54 ` Christoph Anton Mitterer 0 siblings, 0 replies; 14+ messages in thread From: Christoph Anton Mitterer @ 2015-03-29 17:54 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 586 bytes --] On Sun, 2015-03-29 at 16:44 +0200, Kai Krakow wrote: > Yes, the chosen "default" is probably not the best for this kind of utility. > But I suppose it follows the principle of least surprise. At least every > utility I'm daily using (like find) follows this default route. But the default with all these tools is that they operate on the file hierarchy and per default don't care about filesystems at all - or at least not in their original meaning. dedup is IMHO however a more filesystem internal centric operation... more like defragmentation ore tune2fs. Cheers. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5313 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs dedup - available or experimental? Or yet to be? 2015-03-29 11:43 ` Kai Krakow 2015-03-29 12:31 ` Rich Freeman @ 2015-03-29 17:51 ` Christoph Anton Mitterer 1 sibling, 0 replies; 14+ messages in thread From: Christoph Anton Mitterer @ 2015-03-29 17:51 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 609 bytes --] On Sun, 2015-03-29 at 13:43 +0200, Kai Krakow wrote: > Concluding that: duperemove should probably not try to become smart about > filesystem boundaries. It should either cross them or not as it is now - the > option is left to the user (as is the task to supply proper cmdline > arguments with that). Couldn't it per default simply cross boundaries just within the same btrfs fs (i.e. amongst all it's subvolumes), since this seems to be the natural choice users want in most cases,... and via --no-xdev option or something like that it would be allowed to pass boundaries? Cheers, Chris. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5313 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs dedup - available or experimental? Or yet to be? 2015-03-27 0:07 ` Martin 2015-03-27 0:30 ` Rich Freeman @ 2015-03-27 20:51 ` Mark Fasheh 1 sibling, 0 replies; 14+ messages in thread From: Mark Fasheh @ 2015-03-27 20:51 UTC (permalink / raw) To: Martin; +Cc: linux-btrfs On Fri, Mar 27, 2015 at 12:07:29AM +0000, Martin wrote: > Excellent and very rapid packaging, thanks! > > > Already compiled, installed, and soon to be tried on a test subvolume... > > > Anyone with any comments on how well duperemove performs for TB-sized > volumes? https://github.com/markfasheh/duperemove/wiki/Performance-Numbers That page has some sample performance numbers. Keep in mind that the tests were done on reasonably nice hardware. TB-size is definitely on the larger end of what I expect it should handling these days. The biggest problem you would see is memory usage - versions 0.09 and below will be storing all hashes in memory so if everything else is fast enough that's likely the first bump you'll hit. Master branch has some code which reduces our memory consumption dramatically by using a bloom filter and temporarily storing them on disk. That branch needs some more features and bug fixing before I'm ready to call it stable. > Does it work across subvolumes? (Presumably not...) Yep it will dedupe across subvolumes for you! --Mark -- Mark Fasheh ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs dedup - available or experimental? Or yet to be? 2015-03-25 1:30 ` Rich Freeman 2015-03-27 0:07 ` Martin @ 2015-03-27 20:44 ` Mark Fasheh 1 sibling, 0 replies; 14+ messages in thread From: Mark Fasheh @ 2015-03-27 20:44 UTC (permalink / raw) To: Rich Freeman; +Cc: Hugo Mills, Martin, Btrfs BTRFS On Tue, Mar 24, 2015 at 09:30:52PM -0400, Rich Freeman wrote: > On Mon, Mar 23, 2015 at 7:22 PM, Hugo Mills <hugo@carfax.org.uk> wrote: > > On Mon, Mar 23, 2015 at 11:10:46PM +0000, Martin wrote: > >> As titled: > >> > >> > >> Does btrfs have dedup (on raid1 multiple disks) that can be enabled? > > > > The current state of play is on the wiki: > > > > https://btrfs.wiki.kernel.org/index.php/Deduplication > > > > I hadn't realized that bedup was deprecated. > > This seems unfortunate since it seemed to be a lot smarter about > detecting what has and hasn't already been scanned, and it also > supported defragmenting files while de-duplicating them. Hi just FYI, only rescanning files that have changed since the last scan is a feature I've been working on in duperemove for some time now. I have some rudimentary code that works which will be going into master branch in a week or so (I wanted to finish it this week but other things have kept me busy). But anyway that should help with the lack of intelligence on what files to scan. --Mark -- Mark Fasheh ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs dedup - available or experimental? Or yet to be? 2015-03-23 23:22 ` Hugo Mills 2015-03-25 1:30 ` Rich Freeman @ 2015-05-13 16:23 ` Learner Study 2015-05-13 21:08 ` Zygo Blaxell 1 sibling, 1 reply; 14+ messages in thread From: Learner Study @ 2015-05-13 16:23 UTC (permalink / raw) To: linux-btrfs; +Cc: Learner Study Hello, I have been reading on de-duplication and how algorithms such as Bloom and Cuckoo filters are used for this purpose. Does BTRFS dedup use any of these, or are there plans to incorporate these in future? Thanks for your guidance! ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: btrfs dedup - available or experimental? Or yet to be? 2015-05-13 16:23 ` Learner Study @ 2015-05-13 21:08 ` Zygo Blaxell 0 siblings, 0 replies; 14+ messages in thread From: Zygo Blaxell @ 2015-05-13 21:08 UTC (permalink / raw) To: Learner Study; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 2506 bytes --] On Wed, May 13, 2015 at 09:23:25AM -0700, Learner Study wrote: > I have been reading on de-duplication and how algorithms such as Bloom > and Cuckoo filters are used for this purpose. > > Does BTRFS dedup use any of these, or are there plans to incorporate > these in future? btrfs dedup currently lives in user space, and there are multiple dedup userspace projects in development. Administrators can choose the dedup tool, and can use multiple dedup tools on the same filesystem. This is particularly handy if you know something about your data that a naive hashing algorithm might not (e.g. you have two large trees derived from a common base, so you can use a much more efficient algorithm than you would if you knew nothing about the data). The basic kernel interface for dedup is the extent-same ioctl. A userspace program creates a list of (fd, length, offset) tuples referencing identical content and passes them to the kernel. The kernel locks the file contents, compares them, and replaces identical data copies with references to a single extent in an atomic operation. The kernel also provides interfaces to efficiently discover recently modified extents in a filesystem, enabling deduplicators to follow new data without the need to block writes. Most btrfs deduplicators are based on a block-level hash table built by scanning files, but every other aspect of the tools (e.g. the mechanism by which files are discovered, block sizes, scalability, use of prefiltering algorithms such as Bloom filters, whether the hash table is persistent or ephemeral, etc) is different from one tool to another, and changing over time as the tools are developed. Because the kernel interface implies a read of both copies of duplicate data, it is not necessary to use a collision-free hash. Optimizing the number of bits in the hash function for the size of the filesystem and exploiting the statistical tendency for identical blocks to be adjacent to other identical blocks in files enables considerable space efficiency in the hash table--possibly so much that the Bloom/Cuckoo-style pre-filtering benefit becomes irrelevant. I'm not aware of a released btrfs deduplicator that currently exploits these optimizations. > Thanks for your guidance! > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2015-05-13 21:08 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-03-23 23:10 btrfs dedup - available or experimental? Or yet to be? Martin 2015-03-23 23:22 ` Hugo Mills 2015-03-25 1:30 ` Rich Freeman 2015-03-27 0:07 ` Martin 2015-03-27 0:30 ` Rich Freeman 2015-03-29 11:43 ` Kai Krakow 2015-03-29 12:31 ` Rich Freeman 2015-03-29 14:44 ` Kai Krakow 2015-03-29 17:54 ` Christoph Anton Mitterer 2015-03-29 17:51 ` Christoph Anton Mitterer 2015-03-27 20:51 ` Mark Fasheh 2015-03-27 20:44 ` Mark Fasheh 2015-05-13 16:23 ` Learner Study 2015-05-13 21:08 ` Zygo Blaxell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).