Re: send/receive and bedup - Konstantinos Skarlatos

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Konstantinos Skarlatos <k.skarlatos@gmail.com>
To: Mark Fasheh <mfasheh@suse.de>, Brendan Hide <brendan@swiftspirit.co.za>
Cc: Scott Middleton <scott@assuretek.com.au>, linux-btrfs@vger.kernel.org
Subject: Re: send/receive and bedup
Date: Tue, 20 May 2014 01:07:50 +0300	[thread overview]
Message-ID: <537A80B6.9080202@gmail.com> (raw)
In-Reply-To: <20140519173854.GN27178@wotan.suse.de>

On 19/5/2014 8:38 μμ, Mark Fasheh wrote:
> On Mon, May 19, 2014 at 06:01:25PM +0200, Brendan Hide wrote:
>> On 19/05/14 15:00, Scott Middleton wrote:
>>> On 19 May 2014 09:07, Marc MERLIN <marc@merlins.org> wrote:
>>> Thanks for that.
>>>
>>> I may be  completely wrong in my approach.
>>>
>>> I am not looking for a file level comparison. Bedup worked fine for
>>> that. I have a lot of virtual images and shadow protect images where
>>> only a few megabytes may be the difference. So a file level hash and
>>> comparison doesn't really achieve my goals.
>>>
>>> I thought duperemove may be on a lower level.
>>>
>>> https://github.com/markfasheh/duperemove
>>>
>>> "Duperemove is a simple tool for finding duplicated extents and
>>> submitting them for deduplication. When given a list of files it will
>>> hash their contents on a block by block basis and compare those hashes
>>> to each other, finding and categorizing extents that match each
>>> other. When given the -d option, duperemove will submit those
>>> extents for deduplication using the btrfs-extent-same ioctl."
>>>
>>> It defaults to 128k but you can make it smaller.
>>>
>>> I hit a hurdle though. The 3TB HDD  I used seemed OK when I did a long
>>> SMART test but seems to die every few hours. Admittedly it was part of
>>> a failed mdadm RAID array that I pulled out of a clients machine.
>>>
>>> The only other copy I have of the data is the original mdadm array
>>> that was recently replaced with a new server, so I am loathe to use
>>> that HDD yet. At least for another couple of weeks!
>>>
>>>
>>> I am still hopeful duperemove will work.
>> Duperemove does look exactly like what you are looking for. The last
>> traffic on the mailing list regarding that was in August last year. It
>> looks like it was pulled into the main kernel repository on September 1st.
> I'm confused - you need to avoid a file scan completely? Duperemove does do
> that just to be clear.
>
> In your mind, what would be the alternative to that sort of a scan?
>
> By the way, if you know exactly where the changes are you
> could just feed the duplicate extents directly to the ioctl via a script. I
> have a small tool in the duperemove repositry that can do that for you
> ('make btrfs-extent-same').
>
>
>> The last commit to the duperemove application was on April 20th this year.
>> Maybe Mark (cc'd) can provide further insight on its current status.
> Duperemove will be shipping as supported software in a major SUSE release so
> it will be bug fixed, etc as you would expect. At the moment I'm very busy
> trying to fix qgroup bugs so I haven't had much time to add features, or
> handle external bug reports, etc. Also I'm not very good at advertising my
> software which would be why it hasn't really been mentioned on list lately
> :)
>
> I would say that state that it's in is that I've gotten the feature set to a
> point which feels reasonable, and I've fixed enough bugs that I'd appreciate
> folks giving it a spin and providing reasonable feedback.
Well, after having good results with duperemove with a few gigs of data, 
i tried it on a 500gb subvolume. After it scanned all files, it is stuck 
at 100% of one cpu core for about 5 hours, and still hasn't done any 
deduping. My cpu is an Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz, so i 
guess thats not the problem. So I guess the speed of duperemove drops 
dramatically as data volume increases.

>
> There's a TODO list which gives a decent idea of what's on my mind for
> possible future improvements. I think what I'm most wanting to do right now
> is some sort of (optional) writeout to a file of what was done during a run.
> The idea is that you could feed that data back to duperemove to improve the
> speed of subsequent runs. My priorities may change depending on feedback
> from users of course.
>
> I also at some point want to rewrite some of the duplicate extent finding
> code as it got messy and could be a bit faster.
> 	--Mark
>
> --
> Mark Fasheh
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2014-05-19 22:07 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-12 12:27 send/receive and bedup Scott Middleton
2014-05-14 13:20 ` Duncan
2014-05-14 15:36   ` Scott Middleton
2014-05-19  1:07     ` Marc MERLIN
2014-05-19 13:00       ` Scott Middleton
2014-05-19 16:01         ` Brendan Hide
2014-05-19 17:12           ` Konstantinos Skarlatos
2014-05-19 17:55             ` Mark Fasheh
2014-05-19 17:59             ` Austin S Hemmelgarn
2014-05-19 18:27               ` Mark Fasheh
2014-05-19 17:38           ` Mark Fasheh
2014-05-19 22:07             ` Konstantinos Skarlatos [this message]
2014-05-20 11:12               ` Scott Middleton
2014-05-20 22:37               ` Mark Fasheh
2014-05-20 22:56                 ` Konstantinos Skarlatos
2014-05-21  0:58                   ` Chris Murphy
2014-05-23 15:48                     ` Konstantinos Skarlatos
2014-05-23 16:24                       ` Chris Murphy
2014-05-21  3:59           ` historical backups with hardlinks vs cp --reflink vs snapshots Marc MERLIN
2014-05-22  4:24             ` Russell Coker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=537A80B6.9080202@gmail.com \
    --to=k.skarlatos@gmail.com \
    --cc=brendan@swiftspirit.co.za \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=mfasheh@suse.de \
    --cc=scott@assuretek.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.