From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from plane.gmane.org ([80.91.229.3]:50157 "EHLO plane.gmane.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751176AbcEMHON (ORCPT ); Fri, 13 May 2016 03:14:13 -0400 Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1b17IZ-0008Rx-9u for linux-btrfs@vger.kernel.org; Fri, 13 May 2016 09:14:11 +0200 Received: from ip98-167-165-199.ph.ph.cox.net ([98.167.165.199]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 13 May 2016 09:14:11 +0200 Received: from 1i5t5.duncan by ip98-167-165-199.ph.ph.cox.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 13 May 2016 09:14:11 +0200 To: linux-btrfs@vger.kernel.org From: Duncan <1i5t5.duncan@cox.net> Subject: Re: About in-band dedupe for v4.7 Date: Fri, 13 May 2016 07:14:06 +0000 (UTC) Message-ID: References: <20160510221119.GD7633@wotan.suse.de> <23ec14b4-deb3-4294-348c-1f0e4a7db169@cn.fujitsu.com> <20160511025211.GF7633@wotan.suse.de> <20160511173659.GI29353@suse.cz> <20160512205426.GL7633@wotan.suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: Mark Fasheh posted on Thu, 12 May 2016 13:54:26 -0700 as excerpted: > For example, my 'large' duperemove test involves about 750 gigabytes of > general purpose data - quite literally /home off my workstation. > > After the run I'm usually seeing between 65-75 gigabytes saved for a > total of only 10% duplicated data. I would expect this to be fairly > 'average' - /home on my machine has the usual stuff - documents, source > code, media, etc. > > So if you were writing your whole fs out you could expect about the same > from inline dedupe - 10%-ish. Let's be generous and go with that number > though as a general 'this is how much dedupe we get'. > > What the memory backend is doing then is providing a cache of > sha256/block calculations. This cache is very expensive to fill, and > every written block must go through it. On top of that, the cache does > not persist between mounts, and has items regularly removed from it when > we run low on memory. All of this will drive down the amount of > duplicated data we can find. > > So our best case savings is probably way below 10% - let's be _really_ > nice and say 5%. My understanding is that this "general purpose data" use-case isn't being targeted by the in-memory dedup at all, because indeed it's a very poor fit for exactly the reason you explain. Instead, think data centers where perhaps 50% of all files are duplicated thousands of times... and it's exactly those files that are most frequently used. Totally different use-case, where that 5% on general purpose data could easily skyrocket to 50%+. Refining that a bit, as I understand it, the idea with the in-memory- inline dedup is pretty much opportunity-based dedup. Where there's an easy opportunity seen, grab it, but don't go out of your way to do anything fancy. Then somewhat later, a much more thorough offline dedup process will come along and dedup-pack everything else. In that scenario a quick-opportunity 20% hit rate may be acceptable, while actual hit rates may approach 50% due to skew toward the common. Then the dedup-pack comes along and finishes things, possibly resulting in total savings of say 70% or so. Even if the in-memory doesn't get that common-skew boost and ends up nearer 20%, that's still a significant savings for the initial inline result, with the dedup-packer coming along later to clean things up properly. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman