Re: send/receive and bedup - Konstantinos Skarlatos

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Konstantinos Skarlatos <k.skarlatos@gmail.com>
To: Mark Fasheh <mfasheh@suse.de>
Cc: Brendan Hide <brendan@swiftspirit.co.za>,
	Scott Middleton <scott@assuretek.com.au>,
	linux-btrfs@vger.kernel.org
Subject: Re: send/receive and bedup
Date: Wed, 21 May 2014 01:56:52 +0300	[thread overview]
Message-ID: <537BDDB4.1050406@gmail.com> (raw)
In-Reply-To: <20140520223702.GQ27178@wotan.suse.de>

On 21/5/2014 1:37 πμ, Mark Fasheh wrote:
> On Tue, May 20, 2014 at 01:07:50AM +0300, Konstantinos Skarlatos wrote:
>>> Duperemove will be shipping as supported software in a major SUSE release so
>>> it will be bug fixed, etc as you would expect. At the moment I'm very busy
>>> trying to fix qgroup bugs so I haven't had much time to add features, or
>>> handle external bug reports, etc. Also I'm not very good at advertising my
>>> software which would be why it hasn't really been mentioned on list lately
>>> :)
>>>
>>> I would say that state that it's in is that I've gotten the feature set to a
>>> point which feels reasonable, and I've fixed enough bugs that I'd appreciate
>>> folks giving it a spin and providing reasonable feedback.
>> Well, after having good results with duperemove with a few gigs of data, i
>> tried it on a 500gb subvolume. After it scanned all files, it is stuck at
>> 100% of one cpu core for about 5 hours, and still hasn't done any deduping.
>> My cpu is an Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz, so i guess thats
>> not the problem. So I guess the speed of duperemove drops dramatically as
>> data volume increases.
> Yeah I doubt it's your CPU. Duperemove is right now targeted at smaller data
> sets (a few VMS, iso images, etc) than you threw it at as you undoubtedly
> have figured out. It will need a bit of work before it can handle entire
> file systems. My guess is that it was spending an enormous amount of time
> finding duplicates (it has a very thorough check that could probably be
> optimized).
It finished after 9 or so hours, so I agree it was checking for 
duplicates. It does a few GB in just seconds, so time probably scales 
exponentially with data size.
>
> For what it's worth, handling larger data sets is the type of work I want to
> be doing on it in the future.
I can help with testing :)
I would also suggest that you publish in this list any changes that you 
do, so that your program becomes better known among btrfs users. Or even 
a new announcement mail or a page in the btrfs wiki.

Finally, i would like to request the ability to do file level dedup, 
with a reflink. That has the advantage of consuming very little metadata 
compared to block level dedup. It could be done with a two pass dedup, 
first comparing all the same-sized files and after that doing your 
normal block level dedup.

Btw does anybody have a good program/script that can do file level dedup 
with reflinks and checksum comparison?

Kind regards,
Konstantinos Skarlatos
> 	--Mark
>
> --
> Mark Fasheh

next prev parent reply	other threads:[~2014-05-20 22:56 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-12 12:27 send/receive and bedup Scott Middleton
2014-05-14 13:20 ` Duncan
2014-05-14 15:36   ` Scott Middleton
2014-05-19  1:07     ` Marc MERLIN
2014-05-19 13:00       ` Scott Middleton
2014-05-19 16:01         ` Brendan Hide
2014-05-19 17:12           ` Konstantinos Skarlatos
2014-05-19 17:55             ` Mark Fasheh
2014-05-19 17:59             ` Austin S Hemmelgarn
2014-05-19 18:27               ` Mark Fasheh
2014-05-19 17:38           ` Mark Fasheh
2014-05-19 22:07             ` Konstantinos Skarlatos
2014-05-20 11:12               ` Scott Middleton
2014-05-20 22:37               ` Mark Fasheh
2014-05-20 22:56                 ` Konstantinos Skarlatos [this message]
2014-05-21  0:58                   ` Chris Murphy
2014-05-23 15:48                     ` Konstantinos Skarlatos
2014-05-23 16:24                       ` Chris Murphy
2014-05-21  3:59           ` historical backups with hardlinks vs cp --reflink vs snapshots Marc MERLIN
2014-05-22  4:24             ` Russell Coker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=537BDDB4.1050406@gmail.com \
    --to=k.skarlatos@gmail.com \
    --cc=brendan@swiftspirit.co.za \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=mfasheh@suse.de \
    --cc=scott@assuretek.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.