From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vivek Goyal Subject: Re: [PATCH RFCv2 00/10] dm-dedup: device-mapper deduplication target Date: Tue, 3 Feb 2015 11:17:44 -0500 Message-ID: <20150203161744.GA29525@redhat.com> References: <53ffb64b.257e320a.6ec4.2b61@mx.google.com> <20150114194315.GA9520@redhat.com> <20150130155639.GA8364@redhat.com> Reply-To: device-mapper development Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Content-Disposition: inline In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com To: Vasily Tarasov Cc: Joe Thornber , Mike Snitzer , Christoph Hellwig , device-mapper development , Philip Shilane , Sonam Mandal , Erez Zadok List-Id: dm-devel.ids On Tue, Feb 03, 2015 at 11:11:07AM -0500, Vasily Tarasov wrote: > Thanks, Vivek. We'll also start working on adding off-line dedup > support to Dmdedup. Ok, thanks vasily. Let us first review and improve the existing patches for in-line dedup. Once things are in good shape and ready to be merged, then you can look at off-line dedupe. Don't want to bloat the size of patches which contain both in-line and off-line dedupe implementation. Thanks Vivek > > Vasily > > On Fri, Jan 30, 2015 at 10:56 AM, Vivek Goyal wrote: > > On Fri, Jan 23, 2015 at 11:27:39AM -0500, Vasily Tarasov wrote: > > > > [..] > >> > - Why did you implement an inline deduplication as opposed to out-of-line > >> > deduplication? Section 2 (Timeliness) in paper just mentioned > >> > out-of-line dedup but does not go into more details that why did you > >> > choose an in-line one. > >> > > >> > I am wondering that will it not make sense to first implement an > >> > out-of-line dedup and punt lot of cost to worker thread (which kick > >> > in only when storage is idle). That way even if don't get a high dedup > >> > ratio for a workload, inserting a dedup target in the stack will be less > >> > painful from performance point of view. > >> > >> Both in-line and off-line deduplication approaches have their own > >> pluses and minuses. Among the minuses of the off-line approach is > >> that it requires allocation of extra space to buffer non-deduplicated > >> writes, > > > > Well, that extra space requirement is temporary. So you got to pay the cost > > somewhere. Personally, I will be more than happy to consume more disk > > space when I am writing and not take a hit and let worker threads optimize > > space usage later. > > > >> re-reading the data from disk when deduplication happens (i.e. > >> more I/O used). > > > > Worker threads are supposed to kick in when disk is idle so it might not > > be as big a concern. > > > >> It also complicates space usage accounting and user > >> might run out of space though deduplication process will discover many > >> duplicated blocks later. > > > > Anyway, user needs to plan for extra space. De-dup is not exact science > > and one does not know how much will be the de-dup ratio in a data set. > > > >> > >> Our final goal is to support both approaches but for this code > >> submission we wanted to limit the amount of new code. In-line > >> deduplication is a core part, around which we can implement off-line > >> dedup by adding an extra thread that will reuse the same logic as > >> in-line deduplication. > > > > Ok. I am fine with building both if that makes sense. > > > > I also understand that there are pros/cons to both the approaches. Just > > that given the higt cost of inline dedupe, I am finding it little odd > > that it be implemented first as opposed to offline one. > > > > Anyway, I will spend some time on patches now. > > > > Thanks > > Vivek > >